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^-modifications,  the  object  model  developed  by  Reed  (KF.F.T)  78]. 

The  core  of  the  repository  is  stable  nppend  only  storage  called  the  Version  Storage  (VS).  VS  is  Ihc 
only  stable  storage  in  the  repository.  It  contains  the  histories  of  all  objects  in  the  repository  and  all 
the  information  needed  for  crash  recovery.  It  is  assumed  that  VS  will  lie  implemented  with  wrilc- 
oncc  storage  devices  such  as  optical  disks.  I  hc  upper  2",  words  of  VS  arc  kept  in  the  Online 
Version  Storage  (OVS).  Techniques  similar  to  real-time  garbage  collection  are  used  to  keep  the 
current  versions  of  frequently  used  objects  in  OVS.  Two  different  policies  for  retaining  current 
versions  of  objects  in  OVS  are  investigated:  the  actual  implementation  further  depends  on  the  type 
of  storage  devices  used  for  OVS. 

A  critical  concern  addressed  throughout  the  design  of  the  repository  is  recovery  from  system  crashes 
and  storage  device  failures.  The  crash  recovery  of  the  repositories  is  based  entirely  on  the 
information  contained  in  VS;  VS  is  scanned  sequentially,  starting  from  its  current  end,  until  'all 
objects  histories  have  been  reconstructed.  >  The  recovery  can  be  distributed  over  time,  such  that  the 

I 

'  recovery  process  is  invoked  for  one  object  at  a  time,  as  individual  objects  arc  accessed,  lire  same 
mechanism  is  used  to  recover  commit  records.  which  are  data  structures  that  record  the  stale  of 
atomic  actions  and  group  together  the  objects  to  be  updated  in  a  single  atomic  action.  Ihc 
implementation  of  commit  records  in  the  repository  guarantees  that  all  updates  made  by  a  specific 
atomic  action  arc  either  all  completed  or  all  undone,  regardless  of  failures.  Further,  interrupted 
atomic  actions  can  be  continued  from  the  point  of  interruption,  without  any  additional  (backward) 
recovery. 
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ABSTRACT 

SWAI.I.OW  is  an  experimental  distributed  data  storage  system  that  provides  personal  computers 
mill  a  uniform  interface  to  their  local  data  and  the  data  stored  in  shared  remote  servers  called 
repositories.  The  SWAI.I.OW  repositories  provide  reliable,  secure,  and  efficient  long-term  storage 
for  both  very  small  and  very  large  objects  and  support  updating  of  a  group  of  objects  at  one  or 
several  repositories  in  a  single  atomic  action.  The  repositories  support,  with  some  minor 
modifications,  the  object  model  developed  by  Reed  (KIT'D  78J. 

Ihe  core  of  the  repository  is  stable  <//>/><■//</•  <////)  storage  called  the  Version  Storage  (VS).  VS  is  the 
oidy  stable  storage  in  the  repository.  It  contains  the  histories  of  all  objects  in  the  repository  and  all 
the  information  needed  for  crash  recovery.  It  is  assumed  that  VS  w  ill  be  implemented  with  write- 
once  storage  devices  such  as  optical  disks.  Ihe  upper  2n  words  of  VS  are  kept  in  the  Online 
Version  Storage  (OVS).  Techniques  similar  to  real-time  garbage  collection  are  used  to  keep  the 
current  versions  of  frequently  used  objects  in  OVS.  Two  different  policies  for  retaining  current 
versions  of  objects  in  OVS  are  investigated:  the  actual  implementation  further  depends  on  the  type 
of  storage  devices  used  for  OVS. 

A  critical  concern  addressed  throughout  the  design  of  the  repository  is  recovery  from  system  crashes 
and  storage  device  failures.  Ihe  crash  recovery  of  the  repositories  is  based  entirely  on  the 
information  contained  in  VS;  VS  is  scanned  sequentially,  starting  from  its  current  end,  until  all 
objects  histories  have  been  reconstructed.  Ihe  recovery  can  be  distributed  over  time,  such  that  the 
recovery  process  is  invoked  for  one  object  at  a  time,  as  individual  objects  are  accessed.  The  same 
mechanism  is  used  to  recover  commit  records,  which  are  data  structures  that  record  the  stale  of 
atomic  actions  and  group  together  the  objects  to  be  updated  in  a  single  atomic  action.  Ihe 
implementation  of  commit  records  in  the  repository  guarantees  that  all  updates  made  by  a  specific 
atomic  action  arc  either  all  completed  or  all  undone,  regardless  of  failures.  I'urther,  interrupted 
atomic  actions  can  he  continued  from  the  point  of  interruption,  without  any  additional  (backward) 
recovery. 


Keywords:  Distributed  systems,  atomic  actions,  storage  management,  reliability,  recovery. 
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MANAGKMKNT OF  OBJKCT  HISTORIES  IN  III  K  SWALLOW  REPOSITORY 


SWALLOW  is  an  experimental  project  that  will  test  feasibility  of  several  advanced  ideas  on  design 
of  object-oriented  distributed  systems.  Its  purpose  is  to  provide  reliable,  secure  and  efficient  storage 
in  a  distributed  environment  consisting  of  many  personal  machines  and  one  or  more  shared 
repositories.  1'hc  objectives  and  the  overall  structure  of  SWALLOW'  are  presented  in  (KLKI)  80); 
the  major  components  of  the  SWALLOW  system  arc  shown  again  in  figure  1. 

Ivach  personal  machine  runs  a  subsystem  called  a  broker  that  interacts  with  the  manager  of  the  local 
storage  device  and  the  remote  repositories;  this  broker  implements  a  uniform  interface  to  all  objects 
accessible  from  the  personal  computer.  The  repositories  provide  stable,  reliable,  long-term  storage 
for  untyped  objects.  Ihcy  must  handle  efficiently  both  very  small  and  very  large  objects  and 
provide  mechanisms  for  updating  of  a  group  of  objects  at  one  or  more  physical  nodes  in  a  single 
atomic  action. 

'ITiis  report  discusses  the  organization  and  management  of  the  repositories  in  the  SWALLOW 
system.  Hie  repositories  support,  with  some  minor  modifications,  the  object  model  developed  by 
Reed  [Ri:i:i)  78).  This  model  provides  the  basis  for  synchronization  and  recovery  in  the 
implementation  of  atomic  actions.  'Hie  main  features  of  Reed’s  object  model  arc  outlined  in 
Section  1;  however,  the  material  presented  in  this  report  assumes  a  much  deeper  knowledge  of 
Reed’s  work. 

I.  Object  model 

An  object  can  be  viewed  as  a  history  of  all  die  states  assumed  by  the  object  since  its  creation.  Each 
distinguishable  (abstract)  state  of  an  object  is  represented  by  a  special  immutable  entity  called  a 
version.  In  addition  to  having  a  value,  a  version  has  a  time  attribute  that  specifics  its  range  of 
validity.  The  range  of  validity  of  a  particular  version  is  the  time  interval  in  the  history  of  the  object 
during  which  the  object  was  known  to  be  in  the  state  represented  by  the  version.  Rach  version 
delimits  the  range  of  validity  of  the  preceding  version.  All  operations  on  objects  include  an  implicit 
parameter,  a  pseudo-lime,  which  specifics  the  exact  point  in  the  object's  history  to  which  this 
operation  refers.  A  read  operation  selects  a  version  tliat  has  the  highest  "start  time"  that  is  lower 
than  the  pseudo-time  p  specified  in  the  read  request.  If  the  "end  time"  of  that  version  is  lower 
than  p,  it  is  extended  to  p.  A  write  operation  creates  first  a  token,  which  has  to  be  explicitly 
committed  to  become  a  version.  Ilic  start  time  of  that  version  is  the  pseudo-time  specified  in  die 
write  request.  A  token  can  he  later  discarded,  thus  returning  the  object  history  to  the  state  that 
existed  prior  to  the  execution  of  the  write  operation. 


Client 


Client 

interlace 


Swallow 


Figu  re  1 :  Structure  of  the  SWALLOW  system 
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The  object  model  supports  construction  of  atomic  actions.  An  atomic  action  is  a  control  abstraction 
dial  guarantees  the  following: 

i.  atomic  actions  are  mutually  exclusive,  that  is,  operations  performed  as  part  of  one  atomic 
action  cannot  see  or  interfere  with  the  tokens  created  within  a  different  atomic  action,  and 

ii.  die  tokens  created  as  part  of  the  same  atomic  action  arc  either  all  committed  (converted 
into  versions)  or  all  aborted  (removed  from  the  object  histories). 

Associated  with  an  atomic  action  is  a  pseudo- temporal  environment  and  a  possibility.  All  operations 
performed  within  an  atomic  action  arc  assigned  pseudo-times  from  the  same  pseudo-temporal 
environment;  the  pseudo-temporal  environment  is  a  mechanism  for  making  atomic  actions  mutually 
exclusive.  A  possibility  is  a  group  of  tokens  created  by  a  specific  atomic  action.  The  possibility 
mechanism  guarantees  that  only  the  atomic  action  that  created  the  tokens  can  read  them  and  that 
die  tokens  arc  either  all  committed  or  all  aborted. 

Possibilities  arc  represented  by  commit  records.  A  commit  record  is  a  data  structure  that  records 
the  suite  of  a  possibility  and  keeps  track  of  what  entities  arc  dependent  on  the  outcome  of  the 
possibility.  A  commit  record  is  created  with  the  possibility  state  set  to  unknown.  When  an  atomic 
action  completes  successfully,  the  possibility  that  represents  it  is  committed  and  the  possibility  state 
in  the  commit  record  is  set  to  committed.  If  the  atomic  action  is  aborted,  the  possibility  state  in  the 
commit  record  becomes  aborted.  The  commit  record  includes  a  list  of  references  to  tokens  created 
by  the  atomic  action.  Also,  each  token  contains  a  reference  to  its  commit  record. 

Construction  of  atomic  actions  is  controlled  by  the  brokers.  I  bis  includes  generation  of  the  pseudo- 
temporal  environment  for  atomic  actions  and  creation  and  commitment  or  abortion  of  possibilities. 
Ihe  tokens  in  the  same  possibility  can  be  created  by  different  brokers;  thus  the  commit  records  arc 
shared  data  structures  and  must  be  in  some  repository.  The  repositories  therefore  must  implement 
two  abstractions:  the  object  histories  and  the  commit  records.  The  following  are  the  operations  that 
can  be  requested  by  the  brokers  to  be  performed  by  the  repositories.  (Although  the  requests  are  shown  in 
the  form  of  procedure  calls,  this  docs  not  imply  that  a  remote  procedure  rail  type  of  protocol  will  be  used  (I  AMI’  79] 
Also,  the  lists  of  parameters  as  shown  arc  not  necessarily  complete.  Specifically,  instead  of  a  general  acknowledgement, 
the  repository  will  return  enough  information  about  the  request  and  its  result  to  make  the  response  self- identifying.  If 
the  requested  operation  cannot  be  performed,  the  repository  returns  an  error  message): 


Requests  that  pertain  to  object  histories: 

create  (pseudo-time,  commit-rccord-id)  returns  (object-id) 
read  (object-id,  pseudo-time,  commit-rccord-id)  returns  (value) 
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creatc-token  (object-id,  pscudo-limc.  commit-rccord-id,  value)  returns  (ack) 
delete  (object-id.  pseudo-time,  commit-rccord-id)  returns  (ack) 


Requests  that  pertain  to  commit  records: 

create  (timeout)  returns  (ack) 
test  (commit-rccord-id)  returns  (commit-rccord-statc) 
commit  (commit-rccord-id)  returns  (ack) 
abort  (commit-rccord-id)  returns  (ack) 

Additional  operations  on  commit  records  must  be  supported  in  order  to  implement  possibilities  that 
involve  objects  in  more  than  one  repository  (distributed  possibilities);  these  operations,  which  can  be 
requested  only  by  a  repository,  will  be  discussed  in  Secion  5. 

1.1  Representation  or  object  histories 

In  Reed's  original  model,  there  may  be  time  intervals  in  the  object  history  that  do  not  have 
corresponding  versions  (Figure  2).  A  new  version  can  be  created  belatedly  in  any  such  time  interval 
(by  creating  and  committing  a  token),  or  the  interval  can  be  diminished  when  a  request  to  read  the 
value  of  the  object  at  a  time  point  that  falls  within  this  interval  is  executed.  The  latter  action 
extends  the  validity  range  of  the  immediately  preceding  version,  up  to  (including)  the  pseudo-time 
of  the  read  request.  Both  of  these  forms  of  "eduction"  have  to  be  accomodated  in  die  object  history' 
representation. 

Figure  3a  shows  a  linked  list  representation  where  the  range  of  validity  and  the  state  of  the  version 
(tokcn/commitlcd)  is  physically  a  part  of  each  version  representation  [RKKD  78,  RKKD  79J.  An 
alternative  representation  is  to  concentrate  the  various  information  about  versions,  including  the 
pointers  to  the  actual  values,  in  a  separate  data  structure  which  becomes  a  part  of  the  object  header 
(Figure  3b).  Flic  main  problem  with  the  first  scheme  is  that  the  entities  that  represent  versions  arc 
not  immutable.  The  range  of  validity  changes  as  versions  arc  read.  Also,  if  a  new  version  is  inserted 
into  a  gap,  the  "next  version”  link  of  the  version  that  follows  the  new  one  in  time  must  be  changed. 
Similarly,  if  an  action  that  produced  a  token  is  aborted,  the  token  must  be  discarded,  that  is,  the 
token  must  be  removed  from  the  history  by  destroying  the  pointer  to  the  token.  Another 
disadvantage  is  that  if  an  operation  refers  to  an  older  part  of  the  history,  it  is  necessary  to  inspect  all 
newer  versions  to  find  the  appropriate  version  (or  gap).  The  other  scheme  (b)  leads  to  more 
complicated  storage  management.  The  si/c  of  the  object  header  varies  from  object  to  object  and 
changes  as  new  versions  arc  created;  also,  since  it  must  be  possible  to  insert  new  entries  anywhere 
in  the  version  list,  a  simple  array  representation  is  not  possible.  Second,  the  number  of  versions  in 
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Figure  2:  An  example  of  an  object  history. 

Token  X2  was  created  after  version  and  token  Xj.  Version  V?  was  committed  recently,  but  has 
not  had  its  state  cncached  yet.  Reading  the  object  at  time  t|  will  return  the  value  of  version  Vj. 
Reading  the  object  at  time  t^  will  return  the  value  of  version  \S.  after  extending  the  validity  of  this 
version  (end  time  1  )  to  t2-  Attempts  to  read  the  object  at  time  tj  and  I4  will  result  in  a  wait, 
pending  commitment  or  abortion  of  tokens  X2  and  Xj  respectively,  unless  the  read  operation  is 
requested  from  within  the  same  possibility  under  which  the  token  was  created. 


object  id 


b  Version  information  concentrated 
in  the  object  header 


Figure  3:  Possible  representations  of  known  object  histories; 
shown  for  the  example  given  in  Figure  2. 


an  object  history  may  grow  very  large,  and  old  versions  must  be  removed  from  online  storage.  If 
the  stored  versions  physically  contain  the  validity  range  and  linking  information,  this  information 
will  be  purged  from  online  storage  automatically  with  the  old  versions.  If  the  list  of  version 
references  is  kept  in  the  object  header,  it  may  have  to  be  pruned  separately. 

It  is  highly  desirable  to  represent  versions  by  immutable  storage  entities.  Perhaps  the  strongest 
reason  for  this  restriction  is  that  it  is  much  simpler  to  design  mechanisms  to  ensure  integrity  of 
stored  versions. 

One  of  the  main  functions  of  the  repository  is  to  provide  very  reliable  storage.  I  bis  means  that  the 
physical  storage  must  be  stable,  that  is,  the  information  stored  in  it  must  not  decay  over  lime.  In 
addition,  it  is  necessary  to  ensure  that  information  written  to  it  is  either  written  completely  and 
correctly  or  not  at  all.  that  is,  that  the  operations  on  stable  storage  arc  atomic.  Since  no  physical 
device  provides  storage  with  these  properties,  the  atomic  stable  storage  must  be  implemented  as  an 
abstraction,  using  hardware  components  with  less  desirable  properties.  In  particular,  atomic  stable 
storage  must  be  designed  to  tolerate  processor  crashes  during  write  operations  and  decays  of  the 
storage  media.  This  is  accomplished  by  writing  the  data  twice,  into  decay-independent  sets  [LAMP 
79]. 

An  operation  that  is  most  difficult  to  perform  atomically  is  an  in-place  update  of  stored  information. 
An  atomic  update  means  that  either  the  content  of  the  updated  entity  is  changed  into  the  new 
value  or,  if  the  operation  fails,  the  value  of  this  entity  is  left  unchanged.  That  is,  atomicity 
guarantees  that  a  stored  entity  is  never  left  in  an  inconsistent  state  where  the  old  value  has  been  lost 
and  the  new  value  is  incorrect.  To  perform  an  atomic  update,  the  two  copies  of  stored  information 
in  the  decay-independent  sets  must  be  changed  strictly  sequentially,  i.c.  the  first  write  must 
complete  successfully  (correct  data  written  to  correct  address)  before  the  second  write  is  initiated.  If 
the  storage  model  docs  not  have  to  support  an  update  operation,  the  problem  of  atomicity  is 
simplified.  It  is  still  necessary  to  have  two  copies  for  stability,  and  the  ability  to  detect  and  correct 
bad  writes,  but  the  two  writes  into  the  two  decay-independent  sets  can  be  done  concurrently. 

A  second  strong  motivation  for  choosing  an  immutable  representation  for  object  versions  and  tokens 
is  the  possibility  of  using  optical  disks,  which  arc  writc-oncc  storage.  The  given  object  model  will 
require  a  large  amount  of  storage.  Thus,  it  is  important  to  utilize  storage  devices  that  are:  1) 
inexpensive,  2)  easy  to  store  offline.  To  provide  fast  access  to  old  versions,  a  random  access  device 
is  needed.  Optical  disks  look  promising  in  all  these  aspects. 

To  satisfy  the  immutability  requirement  with  the  present  object  model,  it  would  be  necessary  to  use 
the  scheme  of  Figure  3b.  However,  it  will  be  shown  that  with  a  minor  modification  to  the 
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conceptual  object  model  it  is  possible  (and  better)  to  include  most  information  about  versions  in  the 
version  representation. 

1.2  Modified  object  model 

If  we  allow  insertion  of  new  versions  in  an  arbitrary  place  in  the  list,  the  information  about  the 
ordering  of  the  existing  versions  (the  physical  pointers  to  stored  versions)  must  be  kept  in  storage 
that  allows  multiple  (unlimited)  writes.  In  addition,  die  "end  time"  information  for  each  version 
has  to  be  kept  in  such  storage,  since  it  must  be  changed  when  a  version  is  to  be  read  at  a  pseudo- 
time  greater  than  the  current  end  time.  Another  possibility  would  be  to  completely  icwritc  each 
version  every  time  its  end  time  must  be  extended  and  when  a  new  version  is  inserted  after  it.  but 
such  a  scheme  docs  not  seem  practical. 

l  et  us  constrain  the  conceptual  model  such  that  when  a  new  version  is  created,  the  end  time  of  the 
previous  version  is  extended  to  close  the  gap.  'litis  means  that  new  versions  can  be  inserted  only  at 
the  "current"  end  of  the  list.  Also,  each  object  can  have  at  most  oite  token.  Actually,  an  object  could 
have  multiple  "dependent"  tokens  at  the  "current"  end.  as  it  is  done  in  Taka;; i  s  scheme  {TARA  79)  This  possibility  will 
not  be  investigated  in  (his  report  However,  with  the  exception  of  the  current  (latest)  version  and  the 
token,  the  end  time  of  a  version  can  be  derived  from  the  start  time  of  the  next  newer  version  and 
thus  docs  not  have  to  be  included  in  the  version  representation.  Consequently,  an  object  history 
can  he  represented  by  a  fixed-size  object  header  and  a  growing  list  of  immutable  entities  that 
represent  the  versions. 

flic  data  structures  needed  to  represent  an  object  history  arc  shown  in  figure  4.  The  object  header 
contains  a  reference  to  the  current  version  of  the  object  and  the  end  time  of  the  current  version. 
This  time  is  updated  every  lime  the  current  version  is  read  past  its  end  time.  The  object  header  also 
includes  a  token  reference  that  is  either  null  if  the  object  docs  not  have  a  token  or  it  contains  the 
physical  address  of  the  current  token.  One  reason  for  including  both  the  current  version  reference 
and  the  token  reference  in  the  object  header  is  that  it  is  simpler  to  discard  a  token  (remove  it  from 
the  object  history)  when  the  atomic  action  that  created  it  is  aborted.  However,  having  both  of  these 
references  in  the  object  header  i.  crucial  to  the  storage  management,  as  will  be  seen  later.  Tokens 
can  he  read  from  within  the  atomic  actions  that  created  them;  each  such  read  extends  the  end  time 
of  this  future  version.  Since  the  end  time  of  the  current  version  should  not  be  automatically 
extended  tip  to  the  start  time  of  the  token  until  (hat  token  is  actually  committed,  it  is  necessary  to 
keep  track  of  the  end  time  of  the  tokens  as  well  as  the  end  time  of  the  current  versions,  it  should  be 
kepi  in  mind  lhal  the  current  version  end  lime  and  loken  end  lime  in  Hie  object  header  arc  pseudo- times  that  do  nol 
necessarily  correspond  io  real  lime  finally,  a  reference  to  the  commit  record  for  the  current  token  is 
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Figure  4:  Representation  of  the  object  history  for  the  modified  object  model; 
it  is  not  possible  to  create  token  X2  in  this  model. 
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contained  in  the  object  header,  although  this  is  only  an  optimization,  since  this  information  is 
present  also  in  the  token. 

I  hc  data  structures  that  represent  the  versions  arc  called  version  images.  A  version  image  contains, 
in  addition  to  the  "value"  field,  the  "start  time"  tg.  a  reference  to  the  immediately  preceding 
version,  the  uid  of  the  object  it  represents  and  a  reference  to  the  commit  record  for  this  version. 
The  last  two  items  are  needed  for  recovery,  as  will  he  explained  later.  The  time  ls  specifies  the 
beginning  of  the  time  interval  in  the  object’s  history  represented  by  that  version.  Again.  is  is  m*  the 
real  lime  when  Ihc  version  image  was  created,  hut  the  pseudo-lime  specified  in  the  request  to  create  a  token 

It  is  important  to  make  a  distinction  between  versions  and  the  representation  of  versions,  that  is,  die 
version  images.  A  version  is  a  logical  concept;  it  is  the  value  of  the  object  during  a  specific  interval 
in  the  object’s  history.  A  version  image  represents  either  a  version  or  a  token;  to  determine  which 
of  these  two  it  represents,  it  is  necessary  to  inspect  the  object  header  or  the  commit  record  specified 
in  the  version  image.  Several  copies  of  a  version  image  may  coexist  in  the  repository.  Since  versions  arc 
immutable,  this  docs  not  cause  any  synchronization  problem  Also,  a  version  image  may  remain  in  the 
repository  although  it  no  longer  represents  a  valid  version.  Thus  to  discard  a  token  when  the  action 
that  created  it  is  aborted,  it  is  sufficient  to  set  the  token  reference  field  in  the  object  header  to  null. 

In  addition  to  eliminating  the  need  to  include  mutable  data  structures  in  the  version  representation, 
the  modified  model  also  eliminates  the  need  to  perform  a  write  operation  when  an  older  version  is 
read.  The  lost  ability  to  leave  regions  of  the  object's  history  undefined  and  create  versions  in  such 
regions  later  docs  not  reduce  significantly  the  power  of  the  object  model.  In  most  situations,  an 
object  that  is  to  be  updated  is  read  first,  and  it  is  desirable  to  extend  the  end  time  of  the  read 
version  up  to  the  start  time  of  the  new  version  to  ensure  that  the  object  has  not  been  changed  after 
it  was  read. 

1.3  Implementation  issues 

A  crucial  problem  is  to  find  an  efficient  and  reliable  scheme  for  mapping  object  histories  into 
physical  storage.  The  two  structures  used  to  implement  object  histories,  the  object  header  and  the 
list  of  version  images,  require  different  models  of  storage  and  different  management  policies. 
Object  headers  arc  mutable  and  therefore  must  be  kept  in  storage  that  allows  modifications  of 
stored  information.  The  version  images  arc  immutable  and  thus  can  be  stored  in  writc-oncc  storage. 
In  addition,  the  reliability  requirements  are  different. 

The  main  issue  in  the  implementation  of  the  lists  of  versions  is  storage  allocation  and  management. 
Giving  each  object  a  section  of  consecutive  physical  storage  locations  for  its  entire  history  is  clearly 


10 


infeasible.  Rather,  it  scents  natural  to  view  the  version  storage  as  a  history  of  creation  and  updates 
of  all  the  objects  in  the  repository.  Section  2  develops  a  model  of  the  version  storage  as  an  infinite 
append-only  file.  Since  it  is  infeasible  to  keep  the  entire  version  storage  online,  the  online  portion 
of  the  version  storage  must  be  "reusable",  that  is,  it  must  be  p.,,sibte  to  free  it  for  newer  version 
images.  This  problem  is  studied  in  more  depth  in  Section  3.  That  section  addresses  also  the 
problem  of  the  assignment  and  management  of  the  physical  storage  devices  used  to  implement  VS. 

The  role  and  management  of  object  headers  is  discussed  in  Section  4.  It  is  too  expensive  to 
immediately  reflect  all  changes  to  an  object  header  in  stable  storage.  Therefore,  the  object  headers 
are  viewed  only  as  hints  that  may  be  destroyed  by  a  processor  or  storage  device  failure,  but  arc 
reconstructahle  from  the  information  contained  in  the  version  images.  T  hat  section  also  addresses 
how  objects  arc  located  and  how  concurrent  requests  for  the  same  object  are  synchronized. 

Section  5  discusses  the  implementation  and  management  of  commit  records.  Commit  records  arc 
special  data  types  provided  by  the  repository,  but  are  ultimately  mapped  into  the  same  object  model 
as  other  data.  For  possibilities  that  include  objects  in  more  than  one  repository,  commit  record 
representatives  arc  added  to  the  model. 

Recovery  issues  are  addressed  throughout  this  report,  but  the  major  step,  the  reconstruction  of 
object  headers,  is  described  in  Section  6.  Finally,  Section  7  presents  a  summary,  including  a  list  of 
issues  that  must  be  studied  in  more  depth. 


2.  Version  Storage 


live  core  of  the  repository  is  the  Version  Storage  (VS).  Abstractly.  VS  is  an  infinite  append-only 
tape.  VS  stores  information  as  stable  immutable  entities.  ITtese  entities  will  be  called  t'S  images. 
A  VS  image  consists  of  two  fields:  the  data  field,  which  at  this  level  is  simply  an  uninterpreted 
sequence  of  bits,  and  the  si/e  field.  VS  is  the  only  stable  storage  in  the  repository.  It  will  contain 
all  versions  of  all  objects  in  the  repository.  In  addition,  all  the  information  needed  for  a  crash 
recovery  must  be  stored  in  VS,  as  immutable  VS  images. 

Version  images,  as  described  in  Section  1.2,  are  contained  in  the  data  field  of  VS  images.  That  is, 
for  storage  in  VS.  an  envelope  that  contains  the  si/.c  field  is  added  (figure  5).  The  version 
references  in  individual  VS  images  as  well  as  the  current  version  reference  and  the  token  reference 
in  the  object  header  are  directly  the  addresses  of  the  representing  VS  images  in  VS,  Ay,j.  The  lists 
of  versions  representing  histories  of  different  objects  arc  intertwined  in  VS;  their  ordering  in  VS  is 
determined  by  the  relative  frequencies  of  updates  of  individual  objects.  And  io  some  extent  also  by  read 
activities,  as  will  tic  seen  later 

Since  VS  may  grow  arbitrarily  large,  it  is  infeasible  to  keep  it  online  in  its  entirety,  live  issues  of 
what  information  should  be  kept  online  and  how  the  online  storage  is  to  be  managed  arc  discussed 
in  Section  2.1.  Section  2.2  is  concerned  with  the  transfer  of  data  between  the  primary  memory  and 
VS.  Small  objects  (version  images  of  small  objects)  must  be  packed  into  buffers  while  large  objects 
have  to  be  partitioned  into  smaller  pieces,  f  inally.  Section  2.3  discusses  some  problems  with  the 
mapping  of  the  VS  address  space  into  the  physical  address  spaces  of  the  used  storage  devices. 

2.1  Online  Version  Storage 

Only  a  fraction  of  the  information  contained  in  VS  can  he  made  available  online.  One  approach  is 
to  add  a  special  kind  of  cache  for  the  current  versions  of  all  objects.  The  most  straightforward 
policy  for  controlling  the  use  of  such  a  cache  is  to  replace  (overwrite)  the  version  in  the  cache  when 
a  new  version  of  that  object  is  created.  I  lowcvcr,  litis  new  version  may  never  he  committed;  when 
it  is  written  into  the  cache,  it  is  only  a  token.  Alternatively,  the  cache  arid  be  assigned  to  contain 
the  latest  committed  version  of  each  object  and  the  tokens.  When  a  token  is  committed,  the  other, 
now  old,  version  would  be  deleted  and  the  freed  sp;tce  reused.  Since  version  images  can  vary 
greatly  in  si/.c.  the  cache  storage  would  become  fragmented  and  it  would  be  necessary  to  do 
rccoinpaclion  or  garbage-collection.  I  bis  problem  arises  even  if  tokens  arc  allowed  to  overwrite  the 
committed  versions  in  the  cache,  since  subsequent  versions  of  an  object  can  have  greatly  different 
si/cs!  Another  unpleasant  aspect  of  this  form  of  caching  is  that  there  is  no  easy  way  to  deduce  the 
location  of  a  version  images  in  the  cache  from  its  address  in  VS  and  vice  versa:  lints  two  addresses 
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Figure  5:  VS  image  representing  a  version  image. 
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have  to  be  remembered  for  each  version  image  in  the  cache. 


Instead  of  using  a  cache,  the  Online  Version  Storage  (OVS),  that  is,  the  portion  of  VS  currently 
available  online,  will  be  the  most  recent  2n  words  of  VS.  OVS  will  be  implemented  as  a  circular 
buffer,  as  illustrated  in  Figure  6.  Mark  Mp  will  be  used  to  specify  the  current  end  of  VS  on  the 
device  that  serves  as  OVS.  New  version  images  are  created  always  in  OVS,  but  for  read  requests,  it 
is  necessary  to  determine  if  an  image  of  the  specified  version  exists  in  OVS.  Such  a  check  is  very 
simple:  if  (Ap-  -  Ayj)  <=  2n,  where  Ap  is  the  VS  address  of  Mp.  then  the  version  image  is  in 
OVS.  and  its  address  in  OVS  is  ( Avj  mod  2n). 

OVS  shall  contain  the  version  images  created  during  the  interval  (tt.-T.  tc>  where  tL.  is  the  current 
time  and  I  is  determined  by  the  speed  with  which  the  available  online  version  storage  fills  up. 
Unfortunately,  since  versions  of  different  objects  are  created  at  different  rates  even  the  current 
versions  of  some  objects  may  disappear  from  OVS.  To  make  sure  that  all  or  some  objects  (for 
example,  those  objects  that  arc  read  frequently)  retain  their  current  versions  in  OVS,  it  is  necessary 
to  copy  version  images  in  OVS,  and  consequently  in  VS. 

To  preserve  the  current  versions  of  objects  in  OVS,  it  is  not  sufficient  to  copy  just  the  immediate 
current  versions  when  the  time  comes  to  reuse  the  respective  fragment  of  OVS  space:  the  tokens 
have  to  be  copied  too.  Hut,  if  an  object  has  a  token  at  live  time  the  latest  image  of  the  current 
version  is  to  disappear  from  OVS,  it  is  still  necessary  to  copy  the  current  version,  since  the  token 
later  may  be  aborted. 

When  an  image  of  a  current  version  or  a  token  is  copied,  the  appropriate  reference  in  the  object 
header  must  be  changed.  But  if  an  object  has  a  token,  a  reference  to  the  current  version  appears 
not  only  in  the  object  header,  but  also  in  the  token.  Since  the  tokens  arc  to  be  immutable,  the 
reference  to  the  current  version  embedded  in  the  token  cannot  be  changed:  it  will  always  refer  to 
the  version  image  that  represented  the  current  version  at  the  time  when  the  token  was  created. 
Fortunately,  the  fact  that  the  reference  in  the  token  is  not  modified  docs  not  lead  to  an  error.  If  the 
token  becomes  a  version  image,  the  reference  to  llie  copied  version,  which  existed  only  in  the  object 
header,  is  replaced  by  the  reference  to  the  version  image  of  the  former  token.  The  copied  vcision 
image  in  OVS  is  effectively  lost,  but  the  object  docs  have  its  current  version  in  OVS.  If  the  token  is 
aborted,  the  current  version  is  found  in  OVS  as  it  should  be. 

To  summarize,  as  a  consequence  of  the  copying,  VS  may  contain  many  version  images  that 
represent  the  same  version,  but  only  one  of  these  images  is  accessible  by  following  the  chain  of 
pointers  in  the  object  history.  The  other  images  use  up  storage,  but  do  not  have  an  adverse  imp;ict 
on  the  implementation  of  the  object  histories. 
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Figu  re  6:  OVS  as  a  circular  buffer. 
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A  moic  detailed  model  of  OVS  will  be  presented  in  Section  3.  Two  different  policies  for  retaining 
version  images  in  OVS  will  be  investigated:  one  policy  is  u>  keep  the  current  versions  of  all  objects 
in  OVS;  the  other  is  to  keep  in  OVS  only  the  current  versions  of  those  objects  that  have  been  used 
in  the  recent  past.  The  actual  implementation  of  these  policies  depends  further  on  the  type  of 
storage  devices  used. 

2.2  Transfer  of  data  between  primary  memory  and  stable  VS 

Ihc  repository  has  to  handle  efficiently  objects  of  greatly  varying  si/e.  from  very  small  ones  (<  100 
bytes)  to  very  large  ones  (>100  Kbytes).  It  would  be  very  expensive  to  write  small  version  images 
into  VS  individually,  because  of  the  constraints  of  the  communication  network  and  protocols,  very 
large  objects  will  be  sent  to  the  repository  in  pieces:  it  would  be  very  expensive  if  not  impossible  to 
buffer  very  large  objects  in  primary  memory. 

Thus,  prior  to  creating  new  versions  of  objects  in  VS.  it  is  necessary  to: 

1.  pack  small  version  images  (tokens)  before  writing  them  to  VS 

2.  fragment  large  objects  before  writing  them  to  VS. 

Kor  easier  management  of  VS  (mainly  for  faster  VS  address  resolution  and  object  location),  it  is 
desirable  to  allocate  VS  in  fixcd-si/cd  blocks.  These  fixed-si/ed  blocks,  or  pages,  arc  the  units  of 
atomic  write  into  VS.  Doth  the  packing  and  fragmentation  must  take  this  into  consideration. 

2.2.1  I'acking  of  version  images  in  YS  buffers 

l  et  us  first  look  at  the  packing  problem,  basically,  as  tokens  for  new  versions  arc  created,  their 
version  images  arc  placed  into  a  buffer  in  main  memory.  This  buffer  consists  of  one  or  more  pages. 
When  a  buffer  page  is  full,  it  is  written  atomically  into  VS.  However,  there  arc  two  problems  with 
this  scenario.  T'irst,  creation  of  a  token  is  a  commitment  that,  regardless  of  processor,  memory,  or 
device  failures,  if  and  when  the  possibility  under  which  the  token  was  created  is  committed,  the 
token  is  in  the  repository,  undamaged.  Thus  a  creation  of  a  token  cannot  be  acknowledged  until 
the  token  has  been  written  into  stable  VS.  This  action  is  delayed  by  the  packing  process;  since  new 
tokens  will  not  be  created  at  a  constant  rate,  on  an  occasion,  it  may  take  a  long  time  to  till  up  a 
page.  Thus,  a  timeout  should  be  associated  with  each  buffer  page;  if  a  buffer  page  is  not  filled  up 
before  the  timeout,  it  is  written  into  stable  storage  partially  empty.  The  filling  of  the  buffer  is  sped 
up  by  the  copying  process  which  creates  copies  of  old  current  versions  and  tokens  at  the  "high”  end 
of  VS;  these  copies  again  arc  first  written  into  the  buffer. 


The  second  problem  is  what  to  do  if  a  version  image  just  created  or  copied  docs  not  fit  into  the 


space  remaining  in  the  buffer  page.  Or.  resulted,  the  question  is  whether  a  version  image  should  be 
allowed  to  cross  a  page  boundary.  Although  such  a  provision  would  lead  to  a  better  storage 
utilization  and  a  possibility  to  deal  more  flexibly  with  large  objects,  there  are  strong  reasons  for  not 
permitting  it.  Once  split  version  images  arc  permitted,  almost  every  page  will  end  with  a  split 
image,  unless  some  restrictions  are  imposed  in  regards  to  how  version  images  can  be  split.  A  read 
operation  on  a  split  image  requires  more  than  one  VS  access,  t  wo  VS  accesses  if  the  maximum  permitted 
si/e  of  a  version  image  is  one  page  Also,  crash  recovery  would  be  slightly  more  complicated:  since  the 
repository  may  crash  between  the  writes  that  involve  a  split  image,  the  recovery  algorithm  would 
have  to  detect  that  the  image  is  incomplete.  The  last  consideration  is  that  the  buffer  pages  that 
contain  parts  of  a  split  image  have  to  be  mapped  sequentially  into  the  VS  address  space.  Ilic 
alternative  scheme  described  next  will  demonstrate  the  advantage  of  the  lack  of  this  restriction. 

If  split  version  images  are  not  allowed,  it  docs  not  mean  that  the  buffer  pages  have  to  be  written 
into  VS  half  empty.  As  already  indicated,  the  buffer  in  the  main  memory  may  consist  of  several 
pages,  or,  better,  at  any  time,  there  may  be  several  one-page  buffers  for  VS  in  the  main  memory,  as 
shown  in  Figure  7.  flic  timeout  for  each  buffer  is  set  when  the  first  version  image  is  placed  into 
that  buffer.  Now.  new'  version  images  can  be  placed  into  any  of  the  existing  buffers,  or,  if  no 
buffer  offers  enough  space,  a  new  buffer  may  be  created,  subject  to  .1  limit  on  the  number  of 
buffers  allowed.  If  no  more  buffers  may  be  created,  one  must  be  written  into  VS  before  the  new 
version  image  can  be  placed.  Since  no  ordering  (precedence  constraints)  exist  among  the  buffers, 
they  can  be  written  into  VS  in  any  order.  Thus  the  VS  manager  may  select  the  buffer  which  is 
most  full,  or  the  one  which  is  closest  to  its  timeout.  That  buffer  is  then  assigned  the  next  sequential 
VS  page  address,  this  means  that  the  actual  VS  address  of  a  version  image  is  not  known  unlit  the  containing  page 
is  written  into  vs  The  timeout  associated  with  each  buffer  guarantees  that  no  buffer  will  wait  forever 
for  a  version  image  of  the  "right"  size. 

2.2.2  Partitioning  of  large  objects 

l  arge  objects  arc  partitioned  invisibly  to  the  brokers.  However,  this  partitioning  is  not  performed 
solely  by  the  repository,  but  starts  at  the  level  of  the  communication  protocols,  since  the  amount  of 
data  that  can  he  sent  in  a  single  packet  is  limited.  If  this  amount  is  less  than  or  equal  to  the  page 
size  in  the  repository,  no  further  partitioning  is  needed;  otherwise  the  data  received  in  individual 
packets  must  be  further  divided.  In  either  case,  the  fragments  of  an  object  (token)  received  in 
different  packets  can  be  processed  and  written  into  VS  as  they  arrive;  each  fragment  will  become  a 
separate  version  image.  Since  ibis  partitioning  is  invisible  to  the  brokers,  a  broker  must  always  read 
or  write  the  whole  object,  i.e.,  it  is  not  possible  to  retrieve  or  to  update  only  a  small  portion.  This 
means  that  it  should  be  sufficient  to  chain  together  the  fragments  of  such  an  object  and  let  the 
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Figure  7:  Writing  VS  image  into  VS. 

Images  i^  ;irc  packed  in  one-page  buffers,  k  specifies  the  order  in  which  (hey  were  created.  Since 
buffer  2  is  full,  it  is  written  into  VS  (via  OVS)  before  buffer  1. 
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object  header  point  to  the  last  fragment;  it  is  not  necessary  to  have  random  access  to  the  individual 
fragments.  Unfortunately,  if  a  version  image  that  represents  such  a  fragment  of  an  object  is  copied 
by  the  OVS  manager,  it  would  be  necessary  to  modify  a  pointer  in  the  version  image  that  represents 
the  next  piece,  but  this  is  impossible  since  the  version  images  are  immutable.  On  the  other  hand, 
since  the  whole  object  (object  version)  will  be  read,  all  fragments  should  be  copied,  and  the 
embedded  pointers  can  be  modillcd  as  each  fragment  is  copied.  However,  although  the  object 
header  must  point  to  the  last  fragment,  the  copying  must  start  with  the  first  fragment,  otherwise  the 
new  VS  addresses  of  the  individual  fragments  cannot  be  determined.  Actually,  this  also  impacts  the 
initial  creation  of  a  version  of  a  partitioned  object.  A  version  image  of  a  piece  k  cannot  be  created 
until  the  VS  address  of  the  version  image  of  the  fragment  k-l  is  known;  this  again  imposes 
precedence  constraints  on  die  set  of  buffers  for  VS. 

lo  overcome  these  problems,  it  is  necessary  to  have  a  special  pointer  array.  There  arc  several 
reasons  for  not  including  this  pointer  array  in  the  object  header:  as  w  ill  be  seen  in  Section  4,  the 
entire  object  header  must  be  reconstructablc  from  the  information  stored  in  VS  and  therefore  the 
images  of  the  individual  fragments  would  have  to  include  additional  information;  object  headers 
would  have  different  sizes,  and  the  size  of  a  particular  object  header  could  vary  over  its  lifetime; 
but  the  most  serious  problem  is  that  this  would  necessitate  reconsideration  of  how  to  represent 
object  histories.  What  would  be  the  meaning  of  the  "previous  version"  reference  in  each  version 
image?  Different  versions  of  an  object  can  be  partitioned  in  different  ways,  so  there  is  no 
meaningful  mapping  between  fragment  k  of  one  version  and  fragment  k  of  the  preceding  version. 

I 'hus  the  pointer  array  will  be  stored  in  VS.  In  fact,  it  will  look  like  a  version  image.  This  docs 
not  require  any  changes  to  the  object  header:  the  current  version  reference  and  the  token  reference 
simply  point  to  images  that  contain  the  appropriate  pointer  arrays,  as  do  the  "previous  version" 
pointers  in  each  version  image.  A  version  image  constructed  in  this  way  will  be  called  a  slivciured 
version  image.  The  individual  fragments  referred  to  through  this  pointer  array  can  be  of  different 
sizes.  Doth  the  VS  image  that  contains  the  pointer  array  and  the  images  of  the  individual  fragments 
will  be  packed  in  VS  buffers  as  before. 

Noth  for  normal  operations  on  objects  and  for  recovery,  the  information  whether  a  version  is  simple 
(represented  by  a  single  version  image)  or  structured  must  be  included  in  the  version  images 
themselves.  It  docs  not  make  sense,  though,  to  propagate  this  distinction  into  the  definition  of  an 
object,  since  the  representation  may  change  during  object's  lifetime:  .is  an  object  changes  size, 
indiv  idual  versions  may  be  either  simple  or  structured.  This  can  also  happen  because  of  changes  in 
the  lower  level  communication  protocols  (How  control).  Also,  it  is  superfluous  to  include  all  the 
information  so  far  associated  with  all  version  images  in  those  images  that  represent  the  individual 
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fragments  of  a  structured  object  version.  In  fact,  none  of  these  fields  is  needed!  Ilius  for 
representation  of  object  versions  and  tokens,  the  repository  should  provide  three  distinct  types  of 
stable  entities: 


simple  version  image:  self-identifying; 

data  field  contains  tire  actual  data 

self-identifying; 

data  field  contains  an  array  of  pointers 
to  data  images 

data  image:  intcrprclablc  only  in  the  context  of 

the  appropriate  structured  version  image; 
not  used  during  recovery. 

figure  8  shows  a  fraction  of  an  object  history  that  uses  both  simple  and  structured  version  images, 
and  consequently  all  three  types  of  stable  entities  just  described.  However,  these  distinct  entities 
should  be  supported  on  a  higher  level  of  abstraction  than  VS:  the  stability  is  assured  by  mapping 
them  into  the  same  uninterpreted  stable  VS  images. 

Use  of  structured  version  images  docs  not  impose  any  precedence  constraints  on  the  transfer  of 
main  memory  buffers  to  VS.  Of  course,  the  header  of  a  structured  version  image  cannot  be  created 
until  all  data  images  of  that  version  have  been  written  into  VS,  since  the  VS  addresses  are  not 
know  n  until  then.  If  such  a  version  image  is  copied  by  the  OVS  manager,  it  is  necessary  to  create  a 
new  header  after  all  data  images  have  been  copied.  Structured  version  images  are  substantially 
more  expensive  than  simple  version  images,  thus  fragmentation  should  be  used  only  when 
necessary. 

2.3  Mapping  VS  address  space  onto  physical  storage  devices 

To  ensure  that  the  version  storage  is  stable,  all  VS  images  should  be  written  twice,  that  is,  the  entire 
VS  should  be  duplicated.  It  can  oc  assumed  that  two  separately  controlled  physical  devices  provide 
decay-independent  sets  from  the  point  of  view  of  physical  failures  of  the  driving  hardware,  e.g.  head 
crashes.  As  discussed  earlier,  the  two  write  operations  to  duplicate  VS  can  be  performed 
concurrently,  thus  the  response  time  performance  docs  not  have  to  degrade  significantly  as  a  price 
for  stability. 

“  In  addition  to  ensuring  stability  of  stored  information,  it  is  necessary  to  ensure  that  version  images 

arc  written  correctly  into  VS.  Hie  usual  approach  is  to  follow  each  write  by  a  read  and  a  test 


header  of  structured 
version  image: 


Type:  2  =  data  image 


Figure  8:  Representation  of  large  object  versions. 

The  current  version  and  the  token  arc  reprresented  as  structured  version  images. 
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operation.  If  it  is  decided  (after  possibly  several  read  and  test  attempts)  that  the  write  was  incorrect, 
the  write  operation  must  he  t 'pe  tted.  However,  if  the  physical  device  is  wriic-oiuc  only,  the 
repeated  write  has  to  write  the  data  to  a  new  address!  This  may  happen  even  with  devices  dial 
allow  multiple  writes  to  the  same  location,  since  some  areas  on  a  device  may  he  faulty,  and 
consequently  a  write  operation  to  such  a  location  can  never  succeed.  This  problem  can  he  handled 
in  two  ways.  One  is  to  leave  a  "hole"  in  the  VS  address  space.  The  other  one  is  to  mask  the  bad 
write  on  the  device  level  by  writing  into  an  alternative  address  in  an  area  specifically  reserved  for 
this  purpose.  In  the  first  case,  the  comri  VS  address  cannot  be  determined  until  after  the  write  to 
VS  has  succeeded.  I  bis  means  only  that  the  token  reference  (or  the  current  version  reference,  when 
a  copy  operation  is  performed  by  the  OVS  manager)  in  the  object  header  cannot  be  set  until  the  VS 
write  terminates,  but  this  order  must  be  upheld  anyway.  However,  the  duplication  of  VS  creates 
an  additional  problem.  The  address  of  each  of  the  two  copies  of  each  version  image  must  be  easily 
computable  from  the  VS  address.  Huts,  for  a  duplicated  write  if  one  write  operation  docs  not 
succeed,  the  other  one  must  be  invalidated  also.  Thus,  the  same  "hole"  (had  data)  has  to  be  created 
on  both  devices.  This  scheme,  however,  cannot  support  recovery  from  later  decays.  When  it  is 
discovered  that  some  old  version  was  damaged  on  one  device,  titan  in  order  to  restore  die 

redundancy  for  the  future,  it  would  be  necessary  to  copy  the  entire  device,  but  in  this  process, 

different  bad  writes  may  occur,  and  the  two  copies  of  that  part  of  VS  would  be  out  of  sync!  Note 
that  it  is  not  possible  to  copy  just  the  respective  version  image  (from  the  other  device),  since  then 
the  entire  "newer  history"  of  that  object,  that  is,  the  portion  of  the  object  history  between  the 
current  version  and  the  version  represented  by  the  defective  version  image,  would  have  to  be 
recreated. 

Thus,  the  chosen  approach  is  to  preserve  the  continuity  of  the  VS  address  space.  Ivach  device  must 
have  a  reserved  area  that  provides  substitute  locations  for  data  that  could  not  be  written  into  its 
correct  address.  There  still  may  be  "holes"  on  the  device,  but  when  such  a  hole  is  detected,  the 

reserved  area  is  searched  for  the  missing  data.  Thus  both  write  and  read  operations  on  VS  may 

require  several  device  accesses,  but  presumably  the  reserved  area  will  be  used  only  in  rare  cases,  so 
the  performance  penalty  should  be  low.  However,  the  fact  that  the  device  manager  decides  that  a 
write  was  unsuccessful  docs  not  guarantee  that  on  a  later  read  the  same  entity  will  be  detected  as 
bad.  Thus,  the  device  manager  should  explicitely  mark  (overwrite)  the  areas  declared  to  be  holes, 
in  such  a  way  that  holes  can  be  reliably  detected  in  the  future. 

f  inally,  it  is  necessary  to  address  the  problem  of  VS  performance.  The  provision  for  maintaining 
the  current  versions  online  is  only  the  first  step.  The  performance  of  the  repository  will  depend 
strongly  on  the  performance  of  OVS.  that  is,  on  the  speed  of  reading  from  and  writing  to  OVS. 
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Since  write  operations  are  multiplexed  with  random  read  accesses,  the  low  overhead  of  the 
sequential  write  (append)  operations  on  VS  is  lost.  However,  the  repository  is  shared,  and  thus 
there  may  be  many  outstanding  read  requests  to  different  locations  of  the  OVS  device.  The 
performance  of  the  device  (throughput)  can  be  improved  significantly  if  these  requests  are  processed 
in  an  order  that  minimizes  the  positioning  overhead.  The  most  effective  disk  scheduling  algorithm 
is  to  scan  the  disk  in  alternating  directions,  servicing  requests  in  the  order  of  their  physical 
addresses.  Several  variants  of  the  basic  SCAN  algorithm  were  developed  and  analy/ed  JCOI  I-  71); 
however,  since  the  address  distribution  of  requests  in  OVS  is  not  completely  random,  it  may  be 
possible  to  find  a  variant  of  SCAN  th.it  will  perform  better  than  these  general  algorithms.  Also,  a 
possible  enhancement  of  the  SCAN  scheduling  algorithm  for  (lie  OVS  device  is  to  force  a  write  of 
one  of  the  VS  buffers  when  the  disk  heads  reach  the  current  end  of  OVS  (Mj.). 

In  addition  to  finding  a  suitable  algorithm  for  the  OVS  device  management,  performance  of  VS  can 
also  be  influenced  by; 

i.  assigning  physical  addresses  to  VS  addresses 

ii.  mapping  VS  access  requests  to  physical  devices. 

One  possibility  is  to  interleave  VS,  that  is,  assign  consecutive  VS  blocks  to  different  physical  devices. 
This  of  course  requires  additional  device  drives.  However,  it  is  possible  to  lake  advantage  of  the 
i hiplicalion  of  VS.  If  both  dev  ices  in  this  duplicated  implementation  provide  fast  random  read 
access,  a  read  request  can  be  satisfied  by  either  of  the  two  devices  and  can  be  scheduled  for  that 
device  which  is  more  convenient  (i.e.,  not  currently  busy,  or  needs  less  time  to  locate  the  requested 
version  image). 


3.  Management  of  OVS 


I’hc  Online  Version  Storage  is  very  important  to  the  performance  of  the  repository.  As  presented  in 
Section  2.1,  OVS  is  an  online  address  space  managed  as  a  circular  buffer  that  contains  the  most 
recent  2n  words  of  VS.  If  no  version  images  must  be  copied,  removal  of  old  version  images  is 
accomplished  by  simply  overwriting  them  as  Mp  the  end  of  VS  mark,  reaches  that  part  of  OVS. 
However,  if  a  version  image  must  be  copied  to  maintain  the  current  version  of  the  respective  object 
in  OVS.  a  rather  unpleasant  situation  may  arise:  in  order  to  write  a  version  image  for  a  new 
version,  the  OVS  manager  must  copy  one  or  more  version  images  that  lay  ahead  of  Mr  to  make 
enough  space  for  this  new  version  image.  However,  in  order  to  make  space  for  the  copied  version 
images,  more  space  has  to  be  freed.  Such  a  "chain  reaction"  can  be  prevented  if  the  OVS  manager 
looks  ahead  at  which  version  images  may  have  to  be  copied  and  performs  the  copying  before  that 
part  of  OVS  space  must  be  overwritten.  On  the  other  hand,  if  the  copying  is  postponed,  it  may  not  be  necessary 
to  copy  an  old  version  image  of  a  current  version  physically,  since  it  is  approximately  in  the  right  place  with  respect  to 
M| ,  but  some  storage  may  have  to  be  wasted  in  return  M^>  will  be  used  to  mark  the  copy  point  ill  OVS. 

specifies  how  far  the  OVS  manager  has  cleared  OS’S  for  an  immediate  rcusal,  that  is,  no 
version  images  need  be  copied  before  that  part  of  OVS  can  be  reused.  ( M  j.  -  M^jmod  2n  is  then 
the  amount  of  the  immediately  rcuscabic  space. 

The  main  problem  in  managing  OVS  is  how  to  determine  when  a  version  image  must  be  copied.  It 
is  clearly  wasteful  to  examine  every  single  version  image  in  OS'S  as  the  copy  mark  moves;  most 
version  images  should  not  have  to  be  copied,  since  the  respective  objects  already  will  have  newer 
versions.  If  this  assumption  docs  not  hold,  then  this  whole  approach  is  wrong  Since  the  information  whether  a 
version  image  represents  the  current  version  or  the  token  of  an  object  is  embedded  only  in  the 
object  header,  the  decision  process  concerning  what  and  when  to  copy  should  start  at  the  object 
headers. 

In  order  to  maintain  the  current  versions  of  all  objects  in  OVS.  the  objects  should  be  ordered 
according  to  the  time  when  their  current  versions  were  last  written  into  VS.  This  approach  is 
investigated  in  Section  3.1.  In  Section  3.2  the  requirement  that  each  object  must  have  at  least  one 
version  in  OVS  is  relaxed;  this  leads  to  a  much  simpler  implementation.  In  Section  3.3, 
management  of  OVS  is  reexamined  and  adjusted  to  an  implementation  with  write -once  storage 
devices.  Section  3.4  looks  at  the  implementation  of  OVS  from  the  point  of  view  of  the  number  of 
device  drives  needed. 

3.1  Current  versions  of  all  objects  maintained  in  OVS 

The  general  moving  window  scheme  outlined  earlier  can  be  restated  as  follows.  When  more  OVS 
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has  lo  he  cleared  for  reuse,  ihc  OVS  manager  will  search  for  ihe  object  that  has  not  had  a  new 
version  image  mitten  into  Ol’S  for  the  longest  time.  Ihe  current  version  of  this  object  will  be 
referred  to  as  the  oldest  current  version  in  OIS.  l  et  us  call  it  X.  Note  that  ihis  is  mu  ihe 


oldest  current  version  in  the  repository,  that  is,  a  current  version  with  the  lowest  creation  tune  l  since  that  one  may  have 
been  copied  more  recently,  l  et  A^  be  the  VS  address  of  M^.  Then  >=  Ajj,  where  A^  is  the  VS 
address  of  x.  since  by  definition  the  portion  of  OVS  "older"  than  the  position  of  has  already 
been  cleared.  All  version  images  older  than  \.  that  is.  with  addresses  Ay|  <  Av,  can  be  deleted; 
this  means  that  can  be  moved  to  A^  (f  igure  9).  However,  if  A^-  -  A^,  it  is  necessary  to 
copy  X  to  the  "newer"  portion  of  OVS. 


Ihe  first  problem  is  how  to  lind  X.  f  irst  let  us  assume  that  all  objects  in  the  repository  arc 
ordered  according  to  the  time  the  last  version  image  of  then  current  version  was  written  into  OVS, 
that  is.  according  to  the  VS  address  of  the  last  image  of  their  current  versions.  I  he  OVS  manager 
will  maintain  a  sorted  list  of  objects:  let  it  be  called  COI’YI  1ST.  com  tsi  m  laci  would  contain  just 
pointers  to  the  object  headeis  Ihe  object  with  the  oldest  current  version  in  OVS  is  on  the  top  of  the 
list.  When  a  new  version  image  for  some  object  is  written  into  OVS.  the  object  should  move  to  the 
bottom  of  COI'YI  1ST.  Unfortunately,  the  new  version  image  may.  and  in  most  cases  will, 
represent  a  token.  Since  a  token  may  be  later  aborted,  it  is  not  appropriate  to  move  the  object  to 
the  bottom  of  the  COI’YI  1ST  at  the  time  the  version  image  for  the  token  is  created.  Now.  assume 
that  an  object  has  a  token,  and  its  current  version  will  become  subject  to  being  overwritten  if  is 
moved.  Ihc  current  version  must  be  copied,  again  because  the  token  may  be  later  aborted.  IJut 
what  should  be  the  relative  position  of  the  object  in  the  COI'YI  1ST  after  the  current  version  has 
been  copied?  Since  the  version  image  of  the  token  precedes  the  new  version  image  of  the  current 
version,  the  position  of  the  object  in  the  COI’YI  1ST  is  determined  by  the  token.  If  the  token  is 
later  committed,  nothing  need  be  done.  If  the  token  is  aborted,  the  object  must  be  moved  to  a 
position  in  COI'YI. 1ST  that  corresponds  to  the  location  of  the  current  version  in  OVS.  If  the 
current  version  has  not  been  copied  since  the  creation  of  the  token,  no  action  is  necessary, 
finally,  if  the  fate  ot  the  token  is  still  undecided  when  reaches  the  respective  version  image, 
the  token  must  be  copied,  or,  more  precisely,  the  representing  version  image  must  be  copied.  Iliat 
is,  the  OVS  manager  must  al:  >  look  for  the  oldest  token  in  Ol'S,  as  it  clears  OVS. 


To  summarize,  an  object  is  eligible  to  move  in  the  COI'YI. 1ST  only  when: 

1.  its  current  version  is  copied  or 

2.  its  token  is  committed  or 

3.  its  token  is  aborted  or 

4.  its  token  is  copied. 
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X  the  oldest  current  version  version  in  OVS 


//// 


cleared  storage 


—  —  —  pointers  to  offline  VS 


Figure  9:  Release  of  OVS  occupied  by  old  versions. 

Since  objects  1345,  2500  and  2483  already  have  newer  versions,  the  portion  of  OVS  between 
and  can  be  released  without  having  to  copy  any  version  image.  Note  that  the  version  images 
in  the  cleared  storage  arc  still  accessible. 
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I  cl  Acv  and  A(  be  the  current  version  and  the  token  reference  contained  in  the  object  header. 

Then  fable  1  shows  under  what  conditions  the  object  does  move  in  the  COPY!. 1ST.  a  graphical 
illustration  for  a  simpler  kind  of  COPY  1 1ST  will  be  found  in  Section  3.1.  in  figure  II  If  a  nil  reference  (no 
token  exists)  is  represented  by  a  negative  number,  then  to  test  for  an  existence  of  a  token  when  the 
current  version  is  copied,  it  is  sufficient  to  test  if  Acv  <  At.  Thus,  for  any  of  the  four  kinds  of 
events,  the  resulting  position  of  the  object  in  the  COPY  I  1ST  is  always  determined  by  the  greater  of 
Acv  and  At  prior  to  that  event. 

fable  1:  Management  of  COPY  I  1ST 

event:  condition:  result, 

object  is  eligible  lo  object  is  moved  position  of 

move  in  COPY  I  1ST  inCOPYMST  the  object  in 

COPYI  1ST 
determined  by 


current  version 
is  copied 

object  has  a  token 

At 

token  is  committed 

Acv  ^  A^ 

At 

token  is  aborted 

A^  ^  A^y 

Acv 

token  is  copied 

At  <  Acy 

Acv 

The  overhead  of  clearing  OVS  for  reuse  should  be  distributed  over  time.  The  OVS  manager  can  be 
implemented  as  a  demon  process  that  runs  concurrently  with  the  processes  that  create  and  commit 
tokens.  To  maintain  the  amount  of  cleared  OVS  within  specified  limits,  the  demon  is  run  when 
<M|;,  M^>  drops  below  the  lower  limit,  and  it  goes  to  sleep  when  it  has  cleared  enough  space  as 
determined  by  the  upper  limit.  A  large  amount  of  OVS  may  be  cleared  in  just  one  step,  by 
jumping  lo  the  oldest  current  version  or  token  in  OVS.  Thus  it  is  quite  possible  that  the  amount  of 
cleared  space  far  exceeds  the  upper  limit:  many  new  version  images  may  be  created  before  it  is 
necessary  lo  run  the  demon  again.  I  hc  demon  should  not  copy  the  oldest  current  version  or  token 
unless  more  clear  space  is  necessary.  If  the  demon  stops  at  such  a  version,  it  may  be  that  the  next 
time  it  is  run,  the  respective  object  will  by  then  have  a  newer  version,  and  thus  no  copying  is 
needed.  On  the  other  hand,  the  demon  may  run  into  a  situation  when  it  must  copy  almost  every 
version;  this,  of  course,  will  not  free  any  space.  If  this  is  just  a  local  phenomenon,  that  is,  Ihc 
images  of  the  current  versions  of  some  objects  became  clustered,  the  demon  will  eventually  release 
enough  space  (unless  none  of  these  objects  is  ever  updated  again).  Otherwise,  it  might  be  an 


indication  that  the  system  is  saturated. 


litis  scheme  could  be  finely  tuned  to  operate  with  a  very  small  amount  of  cleared  storage,  litis  in 
turn  means  that  multiple  copies  of  a  version  or  a  token  exist  in  OVS  for  only  a  very  brief  time 
interval;  thus  it  is  possible  to  achieve  very  good  OVS  utilization.  in  terms  of  the  useful  information 
stored.  However,  even  if  the  entire  COPYI  1ST  could  be  kept  in  primary  memory,  the  overhead 
of  re-sorting  the  COPYI.IST  may  be  significant,  litis  problem  can  be  eliminated  if  a  different 
policy  for  keeping  current  versions  in  OVS  is  adopted,  as  discussed  in  the  following  sections. 

3.2  Most  recently  used  current  versions  maintained  in  OVS 

In  the  schemes  described  in  the  preceeding  section,  the  OVS  manager  must  maintain  at  least  the 
current  version  of  every  object  in  OVS.  l  itis  means  that  if  I  is  the  average  time  it  takes  to  cycle 
through  OVS.  then  the  current  version  of  an  object  that  has  not  been  updated  for  11T  will  be 
copied  n  times.  This  represents  a  performance  penally  that  may  be  unnecessary,  since  some  objects 
will  not  even  be  read  for  long  periods  of  time,  yet  the  OVS  manager  will  keep  copying  them  in 
OVS.  To  give  a  more  specific  example,  in  a  reasonably  busy  repository,  a  300  Mbyte  disk  used  as 
OVS  may  fill  up  in  less  than  a  day.  It  is  highly  likely  that  many  objects  in  the  repository  will  be 
dormant  for  many  days,  weeks,  or  even  months;  copying  them  every  day  would  be  quite  wasteful. 

The  OVS  management  policy  will  be  relaxed  such  that  only  those  objects  dial  had  their  current 
version  actually  accessed  (read,  or  had  a  new  version  created)  since  tc  -  T  will  be  kept  in  OVS, 
where  T  is  again  the  time  it  takes  to  fill  up  OVS.  With  this  relaxation,  copying  of  dormant  objects 
is  avoided.  In  addition,  the  copying  process  can  be  simplified.  In  particular: 

i.  it  is  not  necessary  to  sort  objects  to  keep  track  of  which  objects  must  have  their  current 
versions  or  tokens  copied  as  the  OVS  manager  works  on  clearing  OVS;  the  current  versions 
and  tokens  can  be  copied  as  they  arc  accessed, 

ii.  no  special  demon  process  is  necessary  to  clear  OVS;  clearing  of  OVS  is  automatically 
distributed  over  time. 

Let  M(_-  specify  again  the  copy  point.  If  =  M^,  (i.c„  Ap  -  =  2n),  an  object  will  maintain 

its  current  version  in  OVS  only  if  a  new  version  is  created  at  least  every  I  time  units.  If  Aj.  - 
<  2n,  the  version  images  of  the  current  version  and  tokens  that  ate  in  this  portion  of  OVS  will  be 
copied  in  OVS  when  read.  The  bigger  the  distance  between  M(_-  and  M|„  the  less  frequently  must 
the  current  versions  be  read  to  remain  in  OVS.  An  additional  optimization  is  possible:  if  the 
version  image  to  be  copied  is  close  to  Mj.  (mod  2n),  then  if  one  is  willing  to  sacrifice  the 
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intervening  storage,  such  a  version  image  docs  not  have  to  be  copied,  since  the  storage  between  the 
version  image  and  Mj,-  is  in  a  sense  already  "cleared." 

A  version  (token)  reference  is  resolved  as  before:  if  Ap  -  Ayj  <  2n,  the  representing  version  image 
is  in  OVS.  In  addition,  when  a  version  image  of  a  current  version  or  a  token  is  read,  then  if  Ayj  < 
Aq  a  copy  of  this  version  image  will  he  created  in  OVS.  To  improve  the  chances  that  the  current 
version  is  in  OVS.  at  the  time  a  token  is  committed,  that  version  image  should  also  be  copied,  if  its 
address  is  lower  than  A^.  If  a  current  version  is  not  represented  in  OVS.  the  appropriate  version 
image  is  retrieved  from  the  offline  VS  and  written  at  the  current  end  of  OVS.  Thus  current 
versions  of  objects  that  have  not  been  read  for  a  long  time  can  be  reinstalled  in  OVS  with  this 
simple  mechanism.  Finally,  it  would  be  possible  to  provide  a  simple  "refresh'’  process  for  those 
objects  that  should  always  stay  online.  This  process  would  periodically  read  such  objects  to  force 
their  copying  in  OVS. 

3.3  Adapting  OVS  management  to  an  implementation  with  write- once  devices. 

The  two  schemes  presented  in  the  preceding  sections  assumed  that  OVS  is  implemented  with 
reusable  physical  storage,  that  is,  that  new  and  copied  version  images  simply  overwrite  those  with 
addrcscs  lower  than  Ap  -  2n.  This  means,  however,  that  the  overwritten  images  must  be  preserved 
at  some  other  device  that  is  a  part  of  the  permanent  VS.  Alternatively,  the  storage  devices  used  in 
OVS  can  be  the  actual  VS.  When  a  device  is  filled  up,  it  is  removed  and  stored  offline,  and  a  fresh 
device  replaces  it.  Since  the  devices  are  written  only  once,  VS  can  he  implemented  entirely  with 
optical  disks.  Unfortunately,  the  line  tuning,  which  is  the  major  attraction  of  the  schemes  presented 
so  far  cannot  be  achieved  when  OVS  is  implemented  in  this  way  since  the  OVS  space  can  be 
"reused"  only  by  replacing  an  entire  device.  Rather,  OVS  should  be  viewed  as  being  divided  into 
fixcd-si/cd  partitions,  where  each  partition  corresponds  to  one  physical  device. 

To  implement  the  same  policy  as  the  one  used  in  Section  3.1,  when  the  current  versions  of  all 
objects  arc  to  be  kept  in  OVS,  it  is  necessary  to  have  the  minimum  of  three  partitions.  These 
partitions,  called  here  LOW  space,  MlDDl.F  space,  and  HIGH  space  do  not  have  to  be  of  equal 
size,  but  fur  simplicity,  let  us  assume  that  they  arc.  Again,  OVS  will  be  managed  as  a  circular 
buffer  (Figure  10).  When  the  MlDDl.F  space  becomes  full,  all  the  version  images  in  the  LOW 
space  will  be  purged  and  the  spaces  will  be  reassigned  such  that: 


MlDDl.F 

LOW 

HIGH 

MlDDl.F 

LOW 

HIGH 
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1345 

\  3221 

CR  ref  OR  rot 


LOW  space 


MIDDLE  space 


HIGH  space 


a)  The  middle  space  is  full,  it  is  necessary 
to  clear  the  LOW  space 


object  headers 


HIGH  space 


LOW  space 


b)  The  current  version  of  object  3221  was  copied, 
and  the  spaces  were  reassigned 


MIDDLE  space 


cleared  storage 
pointers  to  offline  storage 


Figure  10:  Management  of  OVS:  tripartite  scheme. 
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M|„  marks  again  the  current  end  of  VS  in  OVS;  Mj:  fails  into  either  the  MIDDl.H  or  HIGH  space. 

points  always  to  the  beginning  of  the  LOW  space;  it  moves  only  when  the  spaces  are 
reassigned.  To  ensure  that  each  existing  object  will  retain  an  image  of  the  current  version  in  OVS, 
it  is  necessary  to  find  all  objects  that  have  their  current  versions  in  die  I  OW  space.  Copies  of  these 
versions  will  be  created  in  the  NKW  space,  which  is  free  at  the  beginning  of  the  purge  of  the  LOW 
space. 


Ihis  scheme  reduces  the  sorting  problem  into  a  tripartite  sort.  An  object  is  logically  mapped  into 
the  space  which  is  the  older  of;  the  space  that  contains  the  last  version  image  that  represents  the 
current  version,  and  the  space  that  contains  the  last  version  image  that  represents  the  token,  if  any. 
The  conditions  under  which  an  object  moves  into  a  higher  space  are  similar  to  those  for  the 
previous  scheme.  1  el  S  ,v  and  S(  be  the  OVS  spaces  that  correspond  to  the  addresses  Acv  and  A(  at 
a  given  moment.  An  object  is  then  mapped  as  specified  bj  lablc  2.  where  tire  ordering  on  live 
spaces  is  LOW<MII)l)l  I XIIIGH.  I  he  possible  changes  in  the  logical  mapping  of  an  object  into 
the  three  spaces  are  illustrated  in  Figure  11. 


Table  2:  Mapping  to  OVS  spaces 


event: 

object  is  eligible  to  move 
to  a  higher  space 

condition: 
object  is  moved  to  a 
higher  space 

result: 

mapping  of  ihe  object  i 
OVS  spaces  determined 
bv 

current  version  is  copied 

object  has  a  token  and  Scv<St 

S, 

token  is  committed 

Scv<St 

St 

token  is  aborted 

St<SCv 

Scv 

token  is  copied 

st<scv 

Scv 

This  OVS  management  scheme  is  not  limited  to  an  implementation  with  writc-once  devices.  It  is 
possible  to  take  advantage  of  the  simplified  ordering  on  objects  required  by  this  scheme  even  if  the 
physical  OVS  device  is  rcuscablc. 

If  OVS  is  implemented  with  wrile-oncc  devices,  then  although  the  physical  storage  capacity  of  OVS 
is  2n  words.  OVS  docs  not  contain  the  most  recent  211  words  of  VS  as  before.  This  is  because  when 
the  LOW  space  is  reassigned  as  the  HIGH  space,  the  physical  device  for  this  part  of  the  OVS  must 
be  replaced  with  a  fresh  one  and  thus  (he  corresponding  OVS  address  space  does  not  contain  valid 
version  images.  In  fact,  on  average.  50  percent  of  OVS  w  ill  be  empty.  I  his  has  to  be  reflected  in 
the  resolution  of  version  references.  Let  us  use  another  mark,  M|  ,  to  identify  the  oldest  valid  VS 
image  in  OVS:  Mj  will  point  to  the  beginning  of  the  I  OW  space.  In  ihis  scheme.  M(  is  the  same  as 
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a  Situation  just  prior  to  the  beginning  of  a  purge  of  the  LOW  space  Object  3221  has 
its  current  veiston  in  the  LOW  space  and  a  token  in  the  MIDDLE  space. 


object  header 


b.  Situation  just  after  the  current  version  has  been  copied  into  die  HIGH  space 


Figure  1 1 :  Resolving  the  token  problem. 


32 


e.  Situation  during  the  next  purge  The  former  token  was  committed,  but  a  new  version  }: 

was  already  created  (and  committed).  Thus  the  loimer  token  does  not  have  to  be  [ 

copied.  I, 


object  header 


cleared  space 


pointers  to  offline  VS 


f.  Situation  during  the  next  purge.  The  token  was  aborted. 


Figure  1 1 :  Resolving  the  token  problem.  (Cont.) 
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-.  bu(  h  will  be  different  for  (he  other  copy  policy,  as  discussed  below  A  version  image  is  in  OVS  only  if 

A|  ^  ^ 

If  the  management  policy  is  to  maintain  in  OVS  only  those  current  versions  that  have  been  actually 
used  in  the  recent  past,  it  is  sufficient  to  divide  OVS  into  two  partitions.  LOW  space  and  HIGH 
space.  When  the  current  version  of  an  object  is  read,  the  address  of  that  version  image,  Acv,  is 
used  to  determine  whether  this  image  is  in  the  LOW'  or  the  HIGH  space.  If  it  is  in  the  LOW 
space,  it  is  copied  into  the  HIGH  space.  New  versions  (tokens)  are  created  always  in  the  HIGH 
space,  that  is.  Mj.-  maps  always  into  the  HIGH  space.  The  copy  mark  M,-  must  point  to  the 
beginning  of  the  HIGH  space  and  the  mark  M|  to  the  beginning  of  the  LOW  space.  Again,  if  Ayj 
>  A|  .  the  version  image  is  in  OVS.  If  a  version  image  represents  a  current  version,  then  if  Ay:  < 
Af\  the  version  image  will  be  copied. 

These  schemes  resemble  real-time  copying  garbage  collection  algorithms.  However,  in  the  context 
of  garbage  collection,  objects  that  are  not  copied  into  die  HIGH  space  arc  irretrievably  lost.  Titus, 
any  object  to  which  there  exists  a  valid  reference  must  be  copied.  This  would  mean  copying  the 
entire  histories  of  all  objects  in  the  repository,  thus  although  the  bipartite  (and  tripartite)  OVS 
model  and  copying  of  version  images  was  borrowed  from  the  work  on  garbage  collection,  die 
implementation  details  arc  significantly  different.  A  copying  "garbage  collector"  for  large  paged  virtual 
memory  dial  works  in  a  similar  way  as  the  schemes  presented  here  was  recently  proposed  lor  the  I  ISI*  machine,  hot  the 
details  have  not  been  worked  out  yet. 

3.4  Online  support  for  VS 

As  already  discussed  in  the  previous  section,  the  physical  support  of  OVS  may  be  reusable  storage 
devices  that  arc  maintained  permanently  online,  or  just  "reusable”  device  drives,  where  the  storage 
devices  arc  replaced  with  fresh  ones  as  they  become  full.  The  latter  approach  has  the  advantage 
that  the  entire  VS  can  be  implemented  exclusively  with  optica!  disks.  To  implement  the  schemes 
presented  in  Section  3.3,  one  device  drive  is  needed  for  each  OVS  space.  When  the  LOW  space  is 
filled  up.  the  device  that  contains  the  LOW  space  is  replaced  with  a  fresh  device,  and  the  replaced 
device  becomes  part  of  the  off  inc  version  storage.  In  particular,  if  the  policy  that  only  those 
current  versions  and  tokens  actually  accessed  arc  to  be  maintained  in  OVS  is  adopted,  two  drives 
arc  needed;  an  implementation  of  OVS  that  uses  this  management  scheme  will  be  examined  in 
more  detail. 

As  said  earlier,  the  entire  VS  should  be  duplicated  for  stability.  However,  since  version  images  arc 
created  only  in  one  space  at  any  lime,  only  one  additional  device  drive  is  necessary,  to  duplicate  this 
space.  This  duplicate  is  removed  when  that  space  is  filled  up,  and  replaced  with  a  fresh  device  that 
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is  assigned  to  the  next  space. 


finally.  if  it  is  necessary  to  read  a  version  that  is  not  available  in  OVS,  the  respective  device  has  to 
be  found  and  brought  online.  ITiis  requires  yet  another  drive,  figure  12  illustrates  the 
implementation  with  the  minimum  number  of  device  drives. 

To  avoid  long  delays  due  to  the  manual  replacement  of  the  storage  devices,  it  is  necessary  to  add 
one  more  drive.  Two  drives  are  used  for  the  I  OW  and  HIGH  spaces  as  before,  and  two  drives  arc 
assigned  to  VS  backup,  but  the  actual  assignment  of  the  drives  changes  .is  illustrated  in  f  igure  13. 
f’aeh  OVS  space  is  divided  into  two  equal  parts,  and  each  part  is  mapped  into  a  different  backup 
device.  When  the  HIGH  space  is  filled  halfway,  the  backup  device  is  full  and  the  backup  is 
redirected  to  the  other  backup  device.  The  full  backup  device  is  replaced  with  a  fresh  device,  and 
once  the  HIGH  space  is  full,  this  device  will  become  the  new  HIGH  space;  thus  the  drive  is 
reassigned  from  the  backup  function  to  the  "current  VS"  status.  Basically,  at  any  time,  the 
assignment  of  the  drives  is; 

current  VS:  LOW  space 

HIGH  space 

backup:  low  part  of  HIGH  space 

high  part  of  HIGH  space 

when  the  HIGH  space  fills  up,  i  «-  i+1 

The  same  scheme  can  be  implemented  with  a  reusable  device  such  as  a  conventional  magnetic  disk 
in  the  following  way.  Both  partitions,  the  I  OW  and  the  HIGH  space,  can  be  mapped  to  the  same 
dev  ice.  As  the  spaces  are  sw  itched,  the  LOW  space  is  simply  overwritten.  Of  course,  it  is  necessary 
to  ensure  that  the  version  images  that  will  be  overwritten  will  not  be  lost  from  VS.  If  we  assume 
that  all  images  arc  written  twice  for  stability,  the  second  copy  could  be  made  in  nonrcusablc  storage, 
finis  guaranteeing  that  when  the  OVS  device  is  reused,  there  docs  exist  another  copy  of  each 
overwritten  version  image  in  VS.  However,  this  does  not  ensure  future  stability,  since  once  a 
version  image  is  overwritten  in  OVS,  only  one  copy  will  continue  to  exis'  Thus  if  it  is  required 
(hat  the  copies  of  all  images  are  maintained  in  VS.  then  either  every  image  must  be  written  three 
times  when  it  is  created,  or.  a  copy  of  the  I  OW  space  must  be  made  m  nonrcusablc  storage  before 
the  IOW  space  is  reused,  the  latter  looks  like  a  better  solution.  In  particular,  as  a  fresh  HIGH 
space  begins  to  till  up.  the  I  OW  space  can  be  copied  onto  another  device  (f  igure  14). 

The  minimum  number  of  device  drives  needed  is  the  same  as  in  the  implementation  that  uses 
optical  disks  only.  Although  OVS  can  be  put  now  on  a  single  device,  ihv  devices  arc  needed  for 
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management  of  device  drives. 
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A  <he  VS  address 
b  ol  tlie  first  version 
on  the  "current  VS" 
device 


Figure  1 4:  Implementation  of  OVS/VS  with  a  reusable  OVS  device. 


backup.  Kinally.  as  before,  an  additional  drive  is  needed  to  bring  selected  pieces  of  VS  online  when 
a  reference  to  an  old  version  that  is  not  in  OVS  is  made. 


The  need  to  replace  the  backup  device  for  the  HIGH  space  creates  again  the  problem  of  long 
delays.  However,  this  problem  can  be  resolved  without  an  additional  drive.  If  a  "dump"  of  the 
LOW  space  to  the  biickup  device  can  be  finished  sufficiently  fast,  the  backup  device  can  be 
removed  before  the  HIGH  space  fills  up.  and  replaced  with  a  fresh  device  which  will  become  the 
next  "current  VS"  device.  When  the  "current  VS"  device  is  filled,  the  VS  manager  switches  to  the 
other  drive  which  already  has  a  fresh  device  mounted.  Now  a  fresh  biickup  device  needs  to  be 
mounted  on  the  other  drive;  it  should  be  possible  to  perform  this  operation  ami  dump  the  current 
LOW  space  before  the  HIGH  space  fills  up  again,  figure  15  illustrates  the  management  of  the 
device  drives  where  the  VS  devices  arc  twice  the  si/c  of  the  reusable  OVS  device.  To  sian  this 
duplicated  VS  system,  the  first  backup  device  will  be  partially  empty,  corresponding  to  the  first  dump  of  the  I  OW  space, 
which  is  initially  empty. 

Although  it  is  possible  to  save  one  device  drive  compared  to  the  implementation  that  uses  only 
nonrcusablc  devices,  the  performance  penalties  for  an  interleaving  of  the  normal  operation  of  OVS 
with  the  dump  of  the  LOW  space  could  be  severe.  Ihc  only  real  advantage  of  using  a  reusable 
device  for  OVS  is  that  it  is  possible  to  apply  the  more  flexible  moving  window  management 
scheme. 
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device 


a)  Normal  operation,  dump  of  LOW  completed 


backup 

device 


current  VS 
device 


b)  Spaces  switched,  dump  of  LOW  completed. 
Backup  device  is  full  and  can  be  removed 


Figure  1 5:  Implementation  of  OVS/VS  with  a  reusable  device; 
management  of  device  drives. 
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unassigned 


c)  Fresh  device  mounted  on  the  driver  D1; 
momentarily,  there  is  no  backup  device 


d)  Current  VS  device  is  full,  the  VS  manager  switched 
to  the  other  driver;  no  backup  device  yet 


Figure  1 5:  Implementation  of  OVS/VS  with  a  reusable  device; 
management  of  device  drives.  (Cont.) 


4.  Management  of  objects 


An  object  in  the  l)l)SS  repository  is  an  abstract  type.  The  operations  allowed  on  objects  arc: 

create  (pseudo-time,  commit-rccord-id) 

read  (object-id.  pseudo-time,  commit-record-id) 

create-token  (object-id.  pseudo-lime,  commit-rccord-id) 

commit-token  (object-id.  commit-rccord-id) 

abort-token  (object-id.  commit-rccord-id) 

delete  (object-id,  pseudo-time,  commit-rccord-id) 

These  operations  arc  necessary  to  support  die  model  described  in  Section  I.  All  of  these  operations 
are  performed  as  part  of  some  atomic  action.  A  token  can  be  read  only  by  the  atomic  action  that 
created  it.  Similarly,  until  the  creation  of  an  object  is  committed,  only  the  atomic  action  that 
created  the  object  should  be  allowed  to  create  a  token  for  that  object.  The  commit  record  reference 
field  in  the  object  header  can  be  used  also  for  this  purpose.  When  an  object  is  created,  this  field 
will  contain  a  reference  to  the  commit  record  of  the  possibility  for  die  creation:  if  a  token  is  created 
later  under  the  same  possibility,  the  reference  docs  not  change.  When  the  possibility  is  committed, 
this  reference  w  ill  be  set  to  nil,  regardless  of  whether  the  object  has  a  token.  Then  a  loken  can  be 
created  only  if:  the  commit  record  reference  in  the  object  header  is  either  nil  or  is  the  same  as  the 
commit  record  reference  specified  in  the  create-token  request,  and  the  object  docs  not  already  have 
a  version  for  the  specified  pseudo-time. 

In  addition  to  the  external  operations  listed  above,  operations  copy-cv  (copy  current  version)  and 
copy-token  arc  needed  for  OVS  management,  but  these  arc  only  interna!  operations,  available  solely 
to  the  object  manager.  Both  the  external  and  the  internal  operations  must  start  at  the  object  header. 

Objects  in  the  repository  have  identifiers  that  are  unique  both  in  space  and  lime;  all  requests  to 
perform  operations  on  existing  objects  must  include  the  uid  of  the  desired  object.  The  repository 
must  map  the  object  uid  into  a  physical  address  of  the  object  header.  The  most  straightforward  way 
is  to  have  an  object  directory;  this  issue  will  be  discussed  in  Section  4.3. 

Since  the  object  headers  play  such  an  important  role,  they  should  be  stored  in  stable  storage. 
However,  the  object  header  is  updated  twice  for  each  update  of  the  object  (create-token  and 
commit/abort  token),  and  may  be  updated  when  the  current  version  or  token  is  read  (extend  the 
end  time).  T'inally,  the  object  header  is  updated  when  the  version  image  of  the  current  version  is 
copied.  The  additional  disk  write  for  each  update  would  represent  a  large  overhead.  T'urthcr,  object 
headers  should  be  updated  in  place,  otherwise  it  would  be  also  necessary  to  change  the  map  that 
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associates  the  object  uid  with  the  object  header  address.  Thus  read-write  atomic  stable  storage 
would  be  needed,  which  is  more  difficult  and  expensive  to  implement  than  the  append-only  atomic 
stable  storage  used  for  VS.  In  particular,  the  two  writes  must  be  done  sequentially.  Ihus  die 
decision  is  not  to  reflect  all  changes  in  the  object  header  in  stable  storage;  Section  4.1  discusses  how 
the  object  headers  will  be  stored.  Finally.  Section  4.2  looks  at  the  problem  of  synchronizing 
concurrent  accesses  to  objects  on  the  level  of  object  representation. 

4.1  Object  headers 

The  object  headers  arc  stored  on  a  nonvolatile  storage  device  that  allows  unlimited  writes  (e.g., 
magnetic  disk).  This  device  provides  Online  (leader  Storage,  or  01  IS.  Object  headers  are  brought 
into  main  memory  as  needed,  and  (he  changes  made  to  an  object  header  do  not  have  to  be 
propagated  into  the  copy  in  01  IS  until  the  main  memory  used  by  the  object  header  is  to  be 
reassigned.  Since  the  current  object  headers  might  not  be  in  stable  storage  at  the  lime  of  a 
processor  crash  or  a  device  crash,  they  must  be  reconstructable  from  the  information  that  is  in  stable 
storage,  in  particular,  the  information  contained  in  the  version  images.  Consequently,  the  object 
headers  themselves  become  hints:  they  arc  not  necessary  to  guarantee  correct  operation,  but  of 
course  arc  very  important  for  good  performance. 

The  object  header  as  presented  in  Section  \.l  docs  not  contain  all  control  information  that  must  be 
associated  with  an  object.  In  particular,  for  accountability  and  protection,  it  is  necessary  to  associate 
with  each  object  the  owner's  id  and  access  control  specification.  The  access  control  information  has 
to  be  checked  for  every  remote  request.  It  should  be  as  cas>  to  reach  as  the  information  contained 
in  the  object  header;  the  simplest  strategy  is  to  include  it  in  the  object  header.  However,  this 
additional  information  must  be  maintained  in  stable  storage.  The  approach  used  so  far,  that  is, 
inclusion  of  all  such  information  in  version  images,  is  rejected  for  two  reasons:  first,  it  represents 
additional  (and  possibly  substantial)  storage  overhead.  Second,  it  is  illogical  to  keep  write  permit 
information  in  n\n l-onh  versions.  To  make  it  stable  without  having  to  maintain  the  entire  object 
header  in  stable  storage,  the  following  strategy  is  proposed. 

I  he  object  headers  are  maintaincu  in  OHS,  but  OHS  is  not  stable  (i.c„  it  is  not  duplicated),  in  the 
urmmoinp  of  I  ampson  and  Sturgis.  Otis  is  careful  avrugc  I  he  object  header  consists  of  two  parts,  stable 
information  and  a  hint,  as  shown  in  Figure  l ha.  When  an  object  is  created,  and  every  time  the 
stable  information  changes,  the  object  header  is  created  (updated)  in  OHS  ,ijicr  a  new  /fringe  of  the 
entire  object  header  (that  is.  including  the  hint  information)  is  wiiiten  into  VS.  Finally,  the  object 
header  should  be  written  into  VS  when  an  object  is  deleted.  The  information  that  the  object  has 
been  deleted  has  to  he  included  in  the  object  header;  the  access  control  specification  field  could  be 
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b)  Version  image  of  an  object  header 


c)  VS  image  representing  an  object  header 


Figure  1 6:  Object  header  and  its  image  in  VS. 
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used  also  for  this  purpose.  Ilius  in  addition  to  guaranteeing  that  this  information  will  not  be  lost, 
the  repository  keeps  a  complete  history  of  the  changes  of  the  access  rights,  which  may  be  useful  for 
auditing  purposes. 

To  create  an  image  of  an  object  header,  the  object  header  is  simply  treated  as  data,  and  the  same 
fields  (envelope)  are  added  as  for  version  images  (l-'igurc  16b).  The  CK  reference  in  an  object 
header  image  refers  to  the  commit  record  of  the  possibility  under  which  the  object  was  created, 
deleted,  or  die  stable  information  changed.  Thus  treating  object  headers  in  this  way  solves  not  only 
the  stability  problem  but  extends  the  mechanism  for  committing  tokens  to  the  rest  of  the  operations 
that  modify  the  state  of  an  object.  In  addition,  the  object  may  have  a  token,  which  has  its  own 
commit  record. 

The  object  header  images  in  VS  have  to  be  distinguishable  from  the  version  images  and  data 
images:  it  must  be  possible  to  determine  from  the  stored  image  itself  that  the  data  field  represents 
an  object  header.  Thus  object  header  images  represent  yet  another  tagged  type  of  entity  that  can  be 
stored  in  VS.  as  shown  in  Figure  16c. 

The  hint  information  is  guaranteed  to  be  current  only  in  the  main  memory.  Once  in  a  while,  it  is 
written  into  OHS.  and  it  is  also  possible  to  create  periodically  new  images  of  object  headers  in  VS 
as  checkpoints.  Note  that  the  images  of  the  object  headers  will  not  be  continuously  copied  in  VS, 
since  in  the  normal  situation  the  object  headers  will  be  read  from  OHS:  the  VS  images  will  be  used 
only  during  recovery. 

4.2  Synchronization 

The  repository  must  be  able  to  handle  several  requests  concurrently,  since  most  requests  w  ill  require 
one  or  more  disk  accesses.  Also,  the  demon  process  of  the  OVS  manager  runs  concurrently  with  die 
processes  that  execute  the  requests.  In  some  cases  it  is  also  possible  to  process  concurrently  several 
requests  that  pertain  to  the  same  object. 

All  accesses  to  individual  object  histories  have  to  be  negotiated  at  the  object  header:  a  single  lock  or 
monitor  is  needed  per  object.  The  most  natural  place  for  the  lock  is  the  object  header.  However, 
the  locks  must  be  "soft",  that  is.  must  be  automatically  released  by  a  crash,  otherwise  if  the 
operation  th.it  set  (he  lock  was  ahoited  by  a  crash,  the  object  could  remain  locked  out.  This  can  be 
achieved  by  allow  ing  locks  to  be  set  only  on  the  copies  of  the  object  headers  in  main  memory,  but 
this  approach  has  a  serious  shortcoming.  For  easier  memory  management,  objects  should  be  packed 
in  pages.  Since  it  is  not  possible  to  "expand"  the  object  header  to  add  the  lixk  when  the  object 
header  is  mapped  into  main  memory,  the  object  header  must  have  a  permanent  "lock  field”.  If 
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locked  object  headers  are  not  allowed  to  appear  in  OHS.  the  page',  of  object  headers  in  main 
memory  have  to  be  handled  carefully:  they  must  not  be  "write  through",  and  they  cannot  be 
automatically  paged  out  by  the  virtual  memory  manager,  at  least  not  while  some  of  the  object 
headers  on  the  page  have  their  locks  set.  Further,  it  is  not  possible  t o  three  a  modified  object  header 
into  OHS  while  some  other  object  header  on  the  same  page  is  locked.  Alternatively,  the  locks  for 
each  page  of  OHS  could  be  kept  in  a  special  data  structure  (a  bit  vector)  in  main  memory  .  Since 
all  object  headers  arc  of  the  same  size,  finding  die  appropriate  lock  given  the  OHS  address  of  an 
object  header  is  not  difficult. 

The  "automatic  release  of  locks"  after  a  crash  can  be  accomplished  in  yet  another  way:  the 
recovery  process  can  simply  ignore  the  locks  set  on  the  surviving  object  headers  in  OHS.  and  clear 
the  locks  as  part  of  reconstructing  the  object  headers.  I  his  assumes  that  no  normal  processing  is 
allowed  on  any  object  until  the  object  header  has  been  inspected  by  the  recovery  process:  actually, 
as  will  be  seen  in  Section  6.  a  read  request  that  refers  to  a  portion  of  the  object  history  that  is 
accessible  from  die  surviving  object  header  can  be  allowed  to  proceed,  in  spite  of  the  object  header 
being  still  locked  from  the  epoch  before  the  last  crash. 

The  simplest  locking  policy  is  to  lock  the  object  header  for  the  duration  of  each  of  the  operations 
listed  in  the  beginning  of  Section  4,  but  for  maximum  concurrency,  object  headers  should  be  locked 
only  for  the  shortest  possible  time.  This  corresponds  to  operations  on  the  object  representation  that 
must  lie  atomic.  locling  guarantees  only  indivi'ibiliiv  in  the  absence  of  failures  Recoverability  is  provided  by  the 
underlying  vs  system.  The  individual  operations  on  objects  must  lock  the  object  header  as  follows: 


create: 

read: 

crcalc-token: 


commit-token: 


ahort-token: 
copy- vs: 


locking  is  not  necessary  since  the  object  does  not  become  known  until  the 
create  operation  terminates  (returns  object-id) 

find  the  appropriate  version;  if  it  is  still  a  token,  test  if  it  can  be  read: 
change  the  end  time  if  needed 

i.  test  if  this  token  can  he  created;  if  yes.  modify  the  object  header  to  indicate 
dial  the  object  now  has  a  token  (note:  the  VS  address  of  the  token  is  not 
yet  known) 

ii.  set  the  token  reference  after  the  version  image  of  the  token  has  been 
written  into  VS 

i.  change  the  current  version  reference  and  the  current  version  end  time 

ii.  clear  the  token  reference  and  the  related  fields  (the  token  end  time  and  the 

■n- 

commit  record  reference) 
clear  the  token  reference  and  the  related  fields 
change  the  current  version  reference 
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copy-token: 


change  the  token  reference 


I  he  copying  of  version  images,  however,  could  cause  problems  when  interleaved  with  execution  of 
the  external  operations,  in  particular,  commit-token  or  abort-token: 

i.  If  commit-token  is  executed  while  the  current  version  is  being  copied,  the  OVS 
demon  could  change  the  current  version  reference  aflcr  it  has  been  changed  to  point  to  the 
new  committed  token. 

ii.  If  commit-token  or  abort-token  is  executed  while  the  token  is  being  copied,  the  OVS 
demon  could  change  the  token  reference  <///<■/  it  has  been  cleared  to  indicate  that  the  object 
no  longer  has  a  token. 

The  latter  is  a  lesser  problem  (on  the  first  attempt  to  read  such  a  copied  token,  it  would  be 
discovered  that  the  token  was  committed  and  the  object  header  would  be  properly  reset),  but  it  is 
still  annoying.  An  additional  problem  arises  if  the  "copy  when  read”  policy  is  adopted  (Section 
3.2).  When  a  version  image  of  the  current  version  or  token  is  read  and  found  to  be  past  the  copy 
mark,  the  OVS  manager  will  initiate  a  copy  operation.  Now,  if  between  the  test  for  Avj  <  and 
the  completion  of  the  copy  operation  the  same  image  is  read  again,  it  would  be  copied  again!  In 
case  of  such  a  read  and  copy  it  is  particularly  undesirable  to  lock  the  object  until  the  copy  operation 
is  completed,  since  the  requested  version  image  may  have  to  be  read  from  the  offline  VS.  To  solve 
this  problem,  two  flags  should  be  added  to  the  object  header:  cv.copy  and  token.copy,  to  indicate 
that  a  copy  operation  on  the  current  version  or  the  token  is  in  progress.  Subsequent  read  requests 

can  then  proceed,  but  if  the  flag  is  set,  the  positive  outcome  of  the  Avj  <  A^  test  will  not  start 

another  copy  operation. 

The  copy  flags  arc  also  useful  in  commit-token  and  abort-token  operations,  before  changing  the 
current  version  reference  or  the  token  reference  field,  these  operations  should  check  the  appropriate 
copy  flag.  If  the  flag  is  set,  the  conflict  can  be  resolved  in  two  ways: 

i.  Wait  until  the  copy  operation  completes. 

ii.  Abort  the  copy  operation;  that  is,  prevent  the  OVS  demon  from  changing  the  current 

version  or  token  reference.  Nole  that  ihc  particular  version  image  (coin )  may  have  already  been 
written  into  VS  or  will  inevitably  be  written  if  il  is  already  in  some  VS  buffet  when  the  copy  operation 

is  aborted  However,  writing  il  into  VS  will  not  do  any  barm,  not  even  with  respect  to  object  header 

reconstruction  alter  a  crash. 

Note  that  the  crcatc-token  operation  docs  not  have  to  be  concerned  with  simultaneous  copying.  It 
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is  impossible  to  copy  the  token  before  crcatc-tokcn  terminates,  and  it  docs  not  matter  whether  the 
token  refers  to  the  old  or  the  new  version  image  of  the  current  version. 

4.3  Object  directory 

The  object  directory  in  a  repository  serves  two  purposes:  it  locates  objects  actually  stored  in  the 
repository,  and  it  serves  as  a  forwarder  if  an  object  created  in  that  repository  is  moved  into  another 
repository.  Tor  local  objects,  the  directory  contains  the  OHS  address  of  the  object  header.  Tor 
objects  dial  were  moved,  it  contains  just  the  id  of  the  new  repository. 

If  the  directory  of  some  repository  is  lost  or  damaged,  an  exhaustive  search  of  all  repositories  may 
have  to  be  conducted  to  find  an  object  known  to  have  been  created  in  that  repository.  Huts  it  is 
desirable  to  keep  the  directory  in  stable  storage.  The  simplest  way  to  accomplish  this  is  to  represent 
the  directory  as  an  object,  with  a  reserved  OHS  address.  Since  the  directory  will  be  large,  it  will 
have  to  he  represented  as  a  structured  object. 

The  OHS  addresses  do  not  have  to  change  during  the  objects'  lifetimes:  thus  the  directory  must  be 
changed  only  when  an  object  is  created,  moved  to  another  repository,  or  deleted.  Still,  even  with 
relatively  infrequent  changes,  creating  a  new  version  of  the  entire  directory  would  be  very 
expensive.  However,  it  should  be  possible  to  take  advantage  of  the  implementation  of  structured 
objects:  for  each  change  to  the  directory,  it  is  only  necessary  to  create  a  new  data  image  of  the 
affected  piece  and  a  new  structured  version  image  that  differs  from  the  previous  one  only  in  the 
reference  to  the  modified  piece.  Since  the  si/e  of  individual  pieces  can  change,  the  necessary 
modifications  can  be  kept  pretty  locali/cd,  even  if  the  directory  is  represented  as  a  sorted  list  or  a 
tree.  If  an  entry  is  added  and  the  si/e  of  the  affected  piece  exceeds  one  page,  it  is  simply  split  into 
two  pieces. 

Requests  received  by  the  repository  must  contain  the  uid  of  the  desired  object.  'Hie  OHS  address 
of  the  object  header  is  obtained  from  the  directory.  To  improve  performance  it  is  possible  to  return 
to  the  brokers  also  the  OHS  addresses.  These  addresses  can  then  be  included  in  ftjturc  requests,  in 
addition  to  the  object  uid.  However,  they  are  merely  hints,  that  is.  it  is  not  guaranteed  that  the 
particular  object  can  still  be  found  at  that  address  when  the  request  is  received.  Prior  to  accessing 
an  object,  the  object  manager  would  have  to  check  the  validity  of  the  hint  by  testing  it  against  live 
object  uid  in  the  object  header.  If  the  hint  as  received  is  invalid,  a  new  hint  can  be  sent  back  as 
pan  of  the  response  to  that  request.  I  bis  kind  of  hint  could  he  included  also  in  the  directory  for 
those  objects  that  were  moved  into  another  directory. 
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5.  Management  of  commit  records 


Repositories  must  implement  another  abstract  entity  --  the  commit  record.  A  commit  record 
includes  the  stale  of  the  possibility  it  represents,  a  timeout,  and  a  list  of  tokens  (references  to 
tokens)  created  under  that  possibility.  Commit  records  are  mutable  entities:  both  the  possibility 
stale  and  the  list  of  tokens  must  be  modifiable.  While  a  commit  record  is  still  in  an  unknown  state, 
tokens  can  be  added  to  (and  possibly  deleted  from)  the  list  in  the  commit  record.  Once  the 
possibility  is  completed,  the  state  of  a  commit  record  is  set  to  committed  or  aborted  and  tokens  can 
only  be  removed  from  the  list. 

The  list  of  tokens  associated  with  each  commit  record  is  only  an  optimization;  it  is  not  needed  to 
preserve  consistency  as  required  by  the  atomic  action  that  created  the  possibility.  leach  token  refers 
to  its  commit  record:  thus  whether  or  not  a  token  can  be  converted  into  a  version  can  be 
determined  by  inquiring  about  the  state  of  the  commit  record  specified  in  this  reference.  This 
process  can  be  sped  up  with  the  help  of  the  token  list:  when  the  possibility  is  committed  or  aborted, 
all  local  tokens  can  be  committed  or  aborted  immediately.  Another  optimization  is  that  it  is 
possible  to  delete  the  commit  record  once  all  of  the  tokens  on  the  list  have  been  processed.  If  die 
token  list  cannot  be  guaranteed  to  include  all  tokens  created  under  that  possibility,  then  the  commit 
record  must  never  be  deleted,  because  there  is  no  other  mechanism  to  insure  that  all  tokens  arc 
informed  about  the  final  state  of  the  possibility. 

In  Reed's  original  model  |RI-KI)  78|,  the  commit  record  of  a  committed  possibility  is  assumed  to  be 
stored  in  atomic  stable  storage  until  all  tokens  on  the  list  have  been  reliably  changed  to  versions. 
Commit  records  of  uncommitted  possibilities  (aborted,  or  possibilities  the  state  of  which  is  still 
unknown)  do  not  have  to  be  kept  in  stable  storage:  if  the  commit  record  cannot  be  found,  the 
possibility  can  be  assumed  to  have  been  aborted.  Unfortunately,  when  the  recovery  of  the 
repositories  is  considered,  the  list  of  tokens  in  a  commit  record  is  not  sufficient  to  determine  when  a 
commit  record  can  be  deleted.  In  the  present  model,  the  conversion  of  tokens  into  versions  is  done 
merely  by  changing  the  references  in  the  object  header,  and,  as  discussed  in  Section  4.  the  object 
headers  are  not  stable.  As  it  will  be  seen  in  Section  6,  for  recovery  purposes,  it  is  necessary  to  be 
able  to  determine  the  state  of  a  possibility  for  a  long  time  after  all  the  tokens  have  been  converted 
into  versions.  This  means  that  committed  commit  records  must  never  completely  disappear  from  the 
repository;  Section  5.1  presents  a  scheme  that  accomplishes  this  by  representing  commit  records  as 
objects.  A  consequence  of  the  chosen  representation  is  that  the  token  lists  need  never  be  stored  in 
stable  storage.  The  fact  that  the  token  list  does  not  have  to  be  stable  simplifies  also  the 
implementation  of  distributed  possibilities  as  discussed  in  Section  5.2. 
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5.1  Representing  commit  records  us  objects 


For  stability,  commit  records  can  be  mapped  into  VS.  Since  nothing  ever  disappears  from  VS,  a 
commit  record  can  be  reconstructed  even  after  it  has  been  deleted  at  the  level  of  abstraction 
implemented  by  the  commit  record  manager.  Commit  records  could  be  represented  by  yet  another 
type  of  stable  entity  (similar  to  the  object  header  image),  or,  they  could  be  represented  as  objects. 
Implementing  commit  records  as  objects  has  the  advantage  that  all  externally  accessible  entities  in 
the  repository  can  be  located  and  access  to  them  controlled  by  the  same  mechanisms.  On  the  other 
hand,  the  object  abstraction  needs  to  be  extended  to  facilitate  implementation  of  commit  records,  as 
will  be  seen  later. 

There  are  several  possible  ways  to  implement  commit  records  as  objects.  The  following  approach 
was  chosen  because  it  utilizes  best  the  mechanisms  already  present  in  the  object  model.  When  the 
repository  receives  a  request  to  create  a  commit  record,  it  creates  an  object.  The  objects  and  tokens 
created  under  this  possibility  will  use,  as  their  commit  record  reference,  the  uid  of  this  object.  Since 
creation  of  objects  also  must  happen  under  some  possibility,  it  is  necessary  to  supply  a  commit 
record  reference  for  the  object  that  will  represent  a  commit  record.  Recall  tiiai  this  commit  record 
reference  appeals  in  both  the  OIIS  image  and  the  VS  image  of  the  object  header  when  an  object  is  created  Creation 
of  a  commit  record  can  be  committed  immediately.  Tints  a  simple  solution  is  to  set  the  commit 
record  reference  for  a  commit  record  object  to  nil,  to  indicate  that  such  an  object  is  implicitly 
committed. 

latch  stable  image  of  a  commit  record  contains  the  state  of  the  possibility.  The  commit  record 
reference  in  the  version  image  of  an  object  representing  a  commit  record  is  again  nil.  In  this  case, 
however,  nil  commit  record  reference  docs  not  mean  that  the  version  image  is  implicitly  committed. 
Rather,  such  a  version  image  refers  indirectly  to  itself:  the  actual  state  of  the  possibility,  and 
consequently  of  the  representing  version  image,  is  embedded  in  live  data  held  of  that  version  image. 
It  might  be  more  suggestive  to  let  the  commit  record  reference  in  version  images  of  a  commit  recotd  refer  to  the  commit 
record  itself,  hut  it  is  easier  to  test  for  a  nil  reference  than  to  detect  such  a  circular  rcfeicncc 

As  will  be  seen  in  Section  6.  the  '.st  of  tokens  associated  with  a  commit  record  does  not  have  to  be 
stored  in  stable  storage,  since  it  is  only  a  hint;  it  is  not  needed  for  recovery.  If  the  repository 
crashes,  till  objects  will  be  recovered  individually  by  locating  their  latest  version  images  in  VS.  In 
this  process,  (he  object  manager  will  determine  whether  a  version  image  represents  a  version  or  a 
token  by  inspecting  the  appropriate  commit  record:  this  must  be  done  even  for  those  version  images 
that  have  earlier  been  determined  to  represent  committed  versions.  Thus  if  the  repository  crashes 
after  a  possibility  was  committed  but  before  all  of  the  tokens  have  been  converted  into  versions,  it  is 
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not  necessary  to  resume  or  restart  the  conversion  process  since  it  will  be  finished  automatically  as 
part  of  recovery  of  the  individual  objects.  The  only  reason  for  including  the  token  list  in  a  stable 
image  of  a  commit  record  is  to  aid  in  error  detection:  prior  to  converting  a  token,  the  token  list  can 

be  used  to  verify  that  this  token  is  indeed  part  of  that  possibility. 

The  representation  of  a  commit  record  is  shown  in  figure  17.  In  addition  to  creating  an  object  as 
the  commit  record  representation,  the  crealc-commit-record  operation  creates  also  a  token  for  that 
object.  Then  the  waiting  for  the  outcome  of  a  possibility  can  be  accomplished  through  the  already 

existing  mechanism:  a  process  attempting  to  read  the  commit  record  object  will  find  a  token,  and 

consequently  the  read  operation  will  be  delayed  until  the  token  is  either  committed  or  aborted.  To 
commit  a  possibility,  the  commit  record  manager  creates  the  last  version  image  for  the  commit 
record  object  that  has  the  possibility  stale  in  the  data  field  set  to  committed;  this  is  a  committed 
version  which  also  commits  all  the  preceding  tokens.  Now.  if  the  possibility  is  aborted,  it  should  be 
sufficient  to  abort  the  tokens  of  the  commit  record,  l  or  easier  recovery  front  crashes,  however,  the 
commit  record  manager  should,  after  aborting  the  existing  tokens  of  the  commit  record,  create  a 
stable  version  with  the  possibility  state  set  to  aborted  (figure  17c).  f  inally,  although  deletion  of  an 
object  is  merely  a  deletion  of  the  object  header,  it  is  still  important  to  be  able  to  delete  commit 
records,  since  OIIS  is  limited.  With  the  chosen  representation,  commit  records  have  to  be  explicitly 
deleted  even  if  a  possibility  is  aborted  internally,  by  a  repository  crash  or  because  of  a  timeout. 
The  commit  record  manager  should  delete  a  commit  record  after  it  has  processed  the  associated  list 
of  tokens.  Such  a  deletion  is  again  implicitly  committed.  Thus  the  YS  image  of  the  object  header 
created  by  the  delete  operation  will  have  the  commit  record  reference  set  to  nil.  If  the  repository 
crashes  before  the  commit  record  could  be  deleted,  the  commit  record  object  will  be  recovered;  it 
should  be  deleted  as  part  of  the  recovery. 

The  present  object  model  docs  not  permit  creation  of  another  token  and  its  commitment  if  the 
object  already  has  a  token.  Since  a  token  of  a  commit  record  cannot  be  turned  into  a  version  with 
the  existing  mechanisms,  it  is  not  possible  to  create  the  final  version  of  a  commit  record  as 
described  above.  It  would  be  possible  to  add  another  operation,  crcate-vcrsion.  that  would  ignore 
the  token,  but  a  more  general  solution  is  to  extend  the  object  model  such  'hat  it  allows  creation  of 
more  than  one  token  for  the  same  object  within  the  same  possibility.  As  presented  in  the  beginning 
of  Section  4.  the  object  model  already  allows,  within  the  same  possibility,  creation  of  a  token  for  a 
newly  created  and  uncommitted  object;  the  extension  needed  to  support  multiple  tokens  is  very 
simple.  To  create  .mother  token,  a  version  image  is  created  as  for  the  first  token,  but  the  "previous 
version"  field  in  this  verion  image  must  refer  now  to  the  preceding  token  (figure  18).  flic  token 
reference  in  the  object  header  is  changed  to  point  to  the  version  image  of  the  new  token.  Ilic 
commit  record  reference  is  unchanged  since  the  new  token  is  created  under  the  same  possibility. 
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a)  Creation  of  a  commit  record 


b)  Commit  record  of  a  committed  possibility 


Figure  17:  Representation  of  a  commit  record  as  an  object. 
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When  the  possibility  is  committed,  this  entire  chain  of  tokens  is  committed  at  once.  This  docs  not 
require  any  changes:  the  current  version  reference  becomes  the  reference  to  the  last  token,  and  the 
token  reference  is  set  to  nil.  Similarly,  when  the  possibility  is  aborted,  the  entire  chain  of  tokens  is 
aborted.  This  extension  to  the  object  model  facilitates  checkpointing  of  commit  records  and  data 
objects  in  general;  as  an  extreme,  commit  records  can  be  made  stable  throughout  their  lifetime.  To 
achieve  the  latter,  every  time  a  token  reference  is  added,  a  new  version  image  (token)  including  the 
current  list  of  tokens  would  have  to  be  created  for  the  commit  record.  However,  special  care  must 
be  taken  when  a  token  is  copied  by  the  OVS  manager,  first,  only  the  latest  token  (which,  if 
committed,  will  become  the  current  version)  should  be  subject  to  copying.  Second,  if  a  copy-token 
operation  is  in  progress,  it  should  be  completed  before  an  additional  token  can  be  created. 

Commit  records  represent  yet  another  problem.  Once  the  possibility  state  is  set  to  committed  or 
aborted,  it  must  not  change  in  the  future.  If  commit  records  are  represented  as  objects,  this  means 
that  it  must  not  be  possible  to  create  another  version  of  the  representing  object.  This  restriction 
must  be  enforced  by  the  commit  record  manager,  but  it  is  aided  on  the  object  level  by  the  access 
specification  field,  which  can  be  set  to  restrict  the  right  to  update  the  representing  object  to  the 
owner,  that  is,  the  repository. 

5.Z  Distributed  possibilities 

For  a  distributed  possibility,  that  is,  a  possibility  that  includes  objects  in  more  titan  one  repository,  a 
primary  commit  record  is  created  in  one  repository,  and  commit  record  representatives  are  created  in 
each  other  repository  that  contains  a  token  for  this  possibility  (Figure  19).  When  a  possibility  is 
committed  or  aborted,  this  state  is  cncachcd  in  the  commit  record  representatives  in  all  involved 
nodes,  and  the  committment  or  deletion  of  tokens  is  done  locally. 

The  introduction  of  commit  record  representatives  complicates  the  protocol  for  continuing  a 
possibility.  To  be  able  to  rely  on  the  token  lists  in  deciding  when  to  delete  a  commit  record,  all 
representatives  with  their  lists  of  tokens  must  be  first  forced  to  stable  storage  before  a  decision  can 
be  made  whether  the  possibility  can  be  committed:  a  two-phase  commit  protocol  is  needed.  An 
alternative  solution  is  to  treat  the  token  lists  in  the  representatives  only  as  hints,  and  rely  on  the 
dual  mechanism,  that  is,  the  commit  record  references  embedded  in  the  individual  tokens.  A 
protocol  of  this  kind  is  outlined  below. 

Commit  record  representatives  can  be  implemented  in  the  following  way.  To  create  a  commit 
record  representative,  the  repository  creates  again  an  object  with  nil  as  the  commit  record  reference 
(implicitly  committed).  In  addition,  it  creates  a  token  for  this  object,  with  the  uid  of  the  primary 
commit  record  as  its  commit  record  reference.  All  local  tokens  for  this  possibility  will  refer  to  the 
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object  which  is  the  local  representative.  When  the  final  state  of  the  possibility  is  known,  the  token 
of  the  commit  record  representative  is  either  committed  or  aborted.  If  nothing  else  is  done,  then 
during  crash  recovery,  it  would  be  necessary  tv)  inquire  again  about  the  state  of  the  primary  commit 
record,  and  primary  commit  records  would  have  to  be  maintained  (be  easily  accessible)  forever. 
Thus  it  is  desirable  to  cncachc  the  state  of  the  possibility  locally  in  such  a  way  that  crash  recovery 
can  be  confined  to  the  failed  repository.  Again,  it  is  only  necessary  to  create  a  committed  version 
of  the  commit  record  representative  with  the  final  state  of  the  possibility  (committed  or  aborted) 
embedded  in  it;  the  commit  record  reference  in  this  version  is  now  nil. 

I'he  actual  protocol  for  distributed  possibilities  is  summarized  below: 

Token  Meumuhttion  phase:  A  repository  receives  a  request  to  create  a  token  for  object  x  and 
examines  the  commit  record  id  contained  in  the  request:  this  is  always  the  id  of  the  primary  commit 
record.  If  the  respective  object  does  not  already  have  a  committed  version  for  the  specified  pseudo¬ 
time,  or  another  token  that  was  created  under  a  different  possibility,  the  repository  proceeds  to 
create  the  token.  The  crcato-token  operation  Mill  van  fail  if  the  repository  finds  oul  that  the  possibility  specified  in 
the  request  has  already  been  committed  or  aborted  If  this  repository  does  not  contain  the  primary  commit 
record,  it  checks  whether  it  already  has  a  representative  for  this  commit  record.  If  not.  it  sends  a 
request  to  the  primary  commit  record  for  a  permit  to  create  a  local  representative.  If  approved,  it 
creates  the  representative.  Once  the  local  commit  record  representative  is  located  or  created,  the 
repository  creates  the  token  for  object  x  and  sets  its  commit  record  reference  to  the  id  of  the  object 
tli.it  represents  die  commit  record. 

When  the  request  to  create  a  commit  record  representative  is  approved  by  the  primary  commit 
record,  a  reference  to  that  commit  record  representative,  or,  more  precisely,  a  reference  to  the  token 
of  the  representing  object,  is  added  to  the  list  of  tokens  of  the  primary  commit  record.  Note  that 
obtaining  an  approval  from  the  primary  commit  record  is  again  only  an  optimization. 

If  a  repository  fails  during  the  token  accumulation  phase,  the  list  of  tokens,  if  it  existed  only  in  the 
main  memory,  is  lost.  This  docs  not  mean,  however,  that  the  entire  atomic  action  must  he  aborted, 
since  the  representing  object  is  guaranteed  to  survive  the  crash.  I'he  only  complication  is  that  the 
tokens  (including  the  tokens  of  the  representatives  in  other  repositories)  will  have  to  be  converted 
individually,  as  other  atomic  actions  attempt  to  access  those  objects. 

Commit  point:  Requests  to  commit  or  abort  a  possibility  must  be  sent  to  the  primary  commit 
record.  When  the  repository  that  contains  the  primary  commit  tccord  receives  such  a  request,  it 
creates  a  version  image  of  the  primary  commit  record,  with  the  possibility  state  being  either 
ronimiUcd  or  aborted.  Ibis  version  image  may  contain  also  the  list  of  local  tokens  and  the 
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references  to  the  tokens  of  the  representatives  in  other  repositories. 


Conversion  of  tokens:  After  the  commit  point,  the  tokens  at  the  same  repository  as  the  primary 
commit  record  are  removed  from  the  list  and  converted  into  versions  or  aborted.  A  message 
specifying  the  final  state  of  the  possibility  is  sent  to  each  repository  dial  contains  a  representative  for 
this  commit  record,  bach  such  repository,  when  it  receives  such  a  message,  creates  a  version  image 
of  its  local  representative;  the  possibility  state  in  this  version  image  is  set  to  the  same  value  as  the 
state  in  the  version  of  the  primary  commit  record.  The  repository  then  replies  with  a  commit-ack 
message  to  the  primary  and  starts  converting  the  local  tokens  and  removing  them  from  the  list  of 
the  local  representative. 

Deletion  of  commit  record  representative:  When  all  local  tokens  in  the  list  of  a  commit  record 
representative  are  removed,  the  commit  record  is  deleted,  and  consequently  the  representing  object 
is  deleted.  This  approach  should  be  followed  even  if  the  posssib Hit)  has  been  aborted. 

Deletion  of  the  primary  commit  record:  When  the  primary  record  representative  receives  a  commit- 
ack  message  from  a  representative,  it  removes  the  token  reference  for  this  representative  from  its 
list.  I'lie  primary  commit  record  can  be  deleted  when  its  token  list  is  empty. 

Determining  the  state  of  a  token  during  normal  operation:  fo  determine  the  real  state  of  a  token, 
the  commit  record  reference  in  the  token  is  used  to  find  the  local  commit  record  representative.  If 
the  local  object  representing  the  commit  record  still  has  a  token,  then  if  the  commit  record 
reference  in  this  token  is  nil,  this  object  represents  the  primary  commit  record  and  the  state  of  the 
possibility  is  still  unknown.  Otherwise,  it  is  necessary  to  inquire  at  the  primary  commit  record, 
which  is  specified  by  the  commit  record  reference.  If  the  commit  record  has  a  committed  version, 
the  stale  of  the  possibility  is  known  locally,  and  is  embedded  in  that  version. 

A  repository  should  maintain  a  map  from  the  primary  commit  record  ids  to  the  ids  of  the  local 
commit  record  representatives.  This  map  does  not  have  to  be  stable.  According  to  the  protocol 
above,  if  a  local  commit  record  representative  is  not  found  through  this  map,  the  repository  must 
send  a  request  to  the  primary  commit  record  to  approve  a  creation  of  a  representative.  If  the 
primary  commit  record  contains  a  reference  to  a  representative  at  that  repository,  its  id  (the  uid  of 
the  representing  object)  will  be  returned.  If  the  repository  containing  the  primary  commit  record 
failed  also  and  lost  the  token  list  but  the  atomic  action  continues,  the  requesting  repository  may 
receive  an  approval  to  create  a  new  local  record  representative.  This  means  that  a  repository  may 
have  more  than  one  local  representative  for  the  same  possibility,  but  the  mechanisms  of  the  object 
model  and  the  particular  implementation  of  the  commit  record  representatives  still  guarantee 
consistency. 
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6.  Recovery 

At  the  core  of  the  reliability  measures  adopted  for  the  repository  is  the  distinction  between  stable 
information  and  hints.  A  hint  is  information  that  is  not  essential  for  correct  functioning  of  a 
system,  but  is  important  or  even  essential  for  good  performance.  In  the  SWALLOW  repositories, 
all  information  in  main  memory  and  in  OHS  is  considered  to  be  hints  reconstructablc  from  the 
information  in  VS.  The  integrity  of  information  stored  in  VS  and  OHS  is  assumed  to  be  testable: 
this  is  accomplished  by  associating  a  checksum  with  each  page. 

Since  during  normal  operation,  the  repository  relics  primarily  on  the  hints,  it  is  also  important  to  be 
able  to  check  the  integrity  of  the  hints.  A  checksum  could  be  used  also  on  each  page  in  main 
memory,  but  since  most  hints  (the  object  headers)  change  frequently,  it  is  not  feasible  to  recompute 
the  checksum  for  each  such  change.  In  most  cases,  however,  the  validity  of  hints  can  be  tested 
against  the  information  in  VS.  Lor  example,  the  current  version  and  token  reference  fields  of  the 
object  header  must  contain  a  VS  address  Avj  which  is: 

i.  valid  in  VS  address  space, 

ii.  Avj  <  Ak, 

iii.  the  object  uid  contained  in  the  fust  word  of  the  version  image  represented  by  this 
VS  image  matches  the  uid  in  the  object  header. 

Only  the  last  test  is  necessary  to  ensure  that  the  accessed  entity  is  indeed  a  version  of  live  given 
object,  but  the  first  two  tests  can  save  time,  since  they  can  catch  some  errors  without  having  to 
access  VS. 

The  bulk  of  this  section  concentrates  on  the  problem  of  recovering  objects  from  system  crashes  and 
storage  device  decays.  It  is  assumed  that  a  system  crash  invalidates  the  entire  content  of  the  main 
memory.  Hie  major  part  of  a  crash  recovery  is  reconstruction  of  object  headers,  since  the  current 
state  of  the  recently  active  objects  may  have  existed  only  in  the  main  memory. 

If  the  latest  version  image  (the  current  version  or  a  token)  of  an  object  is  known,  all  older  versions 
can  be  found  by  following  the  chain  of  references  embedded  in  the  individual  version  images.  If 
ibis  information  is  lost  (when  the  current  state  of  the  object  header  is  lost  or  damaged),  it  is 
necessary  to  find  this  version  image  by  searching  VS.  I'his  is  why  each  version  image  must  include 
the  uid  of  the  object  of  which  it  is  a  part.  If  each  object  is  guaranteed  to  have  at  least  tbc  version 
images  of  the  current  version  and  the  token  in  OVS,  a  backward  search  of  OVS  will  find  the 
beginning  of  all  object  histories.  Otherwise  the  search  must  he  extended  to  the  online  portion  of 
VS. 
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The  recovery  process  must  examine  every  VS  image,  starting  from  the  end  of  VS.  Hie  issues  of 
how  to  find  the  end  of  VS  and  how  to  isolate  individual  VS  images  on  a  VS  page  are  discussed  in 
Section  6.1.  Section  6.2  presents  an  algorithm  for  reconstructing  the  object  headers  from  the 
information  in  VS.  Section  6.3  describes  how  recovery  of  individual  objects  can  be  distributed  over 
time,  triggered  by  an  access  to  an  object.  Section  6.4  discusses  the  effect  of  a  failure  of  a  repository 
on  the  communication  protocol  between  the  repository  and  the  brokers. 

6.1  Retrieval  of  VS  images 

Before  recovery  of  object  headers  can  begin,  it  is  necessary  to  find  the  current  end  of  the  version 
storage,  that  is.  the  address  of  the  latest  page  written  into  OVS:  the  mark  M j.  can  be  viewed  as 
pointing  to  the  end  of  this  page.  I  bis  address  could  be  found  by  searching  from  the  law  end  of 
OVS  or  from  die  copy  mark,  in  the  direction  of  increasing  VS  addresses.  In  some  of  the  OVS 
management  schemes,  these  other  marks  arc  implicit,  and  thus  no  additional  precautions  must  be 
taken.  To  remember  the  end  of  VS  reliably,  the  mark  M|.  would  have  to  be  kept  in  stable  storage. 
Otherwise,  M.-  can  be  found  by  searching  for  the  first  "free  page."  On  an  optical  disk,  this  means 
the  beginning  of  the  area  that  has  not  yet  been  written.  On  a  magnetic  disk,  each  page,  as  released 
by  the  OVS  demon,  could  be  marked  as  "free."  Such  information  provides  a  useful  check  in 
general:  before  the  version  buffer  in  main  memory  is  written  into  VS.  the  specified  OVS  page 
should  be  checked  if  it  is  free. 

flic  failure  might  have  occurred  between  the  two  phys'^al  writes  in  the  duplicated  implementation 
of  stable  VS.  If  the  latest  page  as  w  ritten  to  one  of  the  devices  is  found  correct,  the  VS  write  can 
be  completed,  that  is.  that  page  is  written  also  to  the  other  device;  otherwise  that  page  should  be 
marked  as  had.  and  the  end  of  VS  set  to  the  end  of  the  preceding  page  (the  latest  page  on  the  other 
device).  No  external  request  (that  is,  a  request  from  a  broker  or  another  repository)  for  which  some 
information  has  to  be  written  into  VS  is  acknowledged  until  both  writes  complete;  thus  if  a  (dual) 
VS  page  is  declared  bad  because  the  second  write  did  not  complete  correctly,  no  harm  is  done. 
However,  if  the  write  is  completed  during  recovery,  the  creatc-tokcn  requests  that  caused  creation  of 
VS  images  on  that  page  cannot  be  acknowledged,  since  the  repository  lost  all  information  about 
these  requests,  flic  original  requestors  may  retry  their  requests,  in  which  ease  the  recovered 
repository  will  send  back  an  acknowledgement,  as  discussed  in  Section  6.4.  Otherwise,  the 
individual  tokens  on  that  page  eventually  will  be  aborted  because  of  a  timeout.  Copies  of  version 
images  made  by  the  OVS  manager  will  be  found  and  incorporated  into  the  chains  representing  the 
object  histories  by  the  main  recovery  process. 

The  next  problem  is  to  isolate  the  individual  VS  images.  I  hc  scan  of  VS  should  proceed  from  its 
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high  end  towards  the  low  end:  for  individual  pages,  this  means  from  the  end  of  a  page  towards  its 
beginning.  This  means  that  the  size  field  should  be  at  higher-address  end  of  a  VS  image.  Since  for 
normal  use,  the  position  of  the  size  field  must  be  computable  from  the  VS  address  contained  in  a 
version  or  token  reference,  this  implies  that  VS  images  should  be  stored  so  that  their  first  word,  that 
is,  the  word  specified  by  those  references,  has  the  highest  VS  address.  Finally,  if  a  page  is  not 
completely  filled  when  written  into  VS,  a  dummy  data  image  should  be  created  in  the  unused  space. 
This  dummy  data  image  will  be  discovered  only  during  recovery,  but  it  will  be  automatically 
ignored  since  all  data  images  arc  ignored  during  recovery:  only  the  size  field  of  the  representing 
VS  image  is  used  to  get  to  the  beginning  of  the  preceding  VS  image. 

6.2  Reconstruction  of  object  headers 

Since  most  repository  crashes  will  not  damage  OHS.  the  recovery  process  can  use  OHS  image  as  the 
Starting  point.  As  slated  earlier,  it  is  assumed  that  a  checksum  is  associated  with  each  01  IS  page,  and  that  it  is 
sufficient  to  test  the  integrit)  of  the  object  headers  on  the  page  The  value  of  the  field  that  specifies  the  end 
validity  time  of  the  current  version  in  the  object  header  in  OHS  provides  a  logical  delimitation  for 
recovery:  only  if  some  version  (token)  was  created  after  this  time  (this  would  mean  that  the  OHS 
image  was  not  updated),  the  hint  in  the  object  header  must  be  updated  (reconstructed)  from  the 
information  in  VS.  Unfortunately,  because  of  the  copying  of  version  images  in  VS,  there  is  no 
simple  unique  mapping  from  time  to  a  physical  location  in  VS.  Thus  only  the  current  version 
reference  Acv  and  the  token  reference,  At,  in  the  surviving  object  header  arc  useful:  VS  must  be 
searched  only  as  far  as  the  higher  of  these  two  addresses. 

If  the  object  header  in  01  IS  is  damaged,  VS  must  be  searched  until  all  of  the  following  is  found: 

1)  a  version  image  of  the  current  version 

2)  a  version  image  of  the  token  (if  any) 

3)  an  image  of  the  object  header. 

If  the  OHS  image  is  not  damaged  hut  is  merely  obsolete,  it  is  only  necessary  to  find  the  first  two 
items.  If  the  found  image  of  the  object  header  precedes  the  version  images  of  both  the  current 
version  and  the  token  (i.c„  it  is  the  latest  entity  in  VS  pertaining  to  this  object),  the  object  header  is 
recovered  without  any  further  search.  If  a  version  image  of  a  token  is  found  first,  it  is  not  necessary 
to  search  for  a  version  image  of  the  current  version,  since  a  reference  to  it  is  contained  in  the  token. 
However,  if  a  version  image  of  the  committed  version  is  found  first,  it  could  be  a  copy,  and  thus  it 
is  still  necessary  to  search  for  a  token.  Moreover,  ibis  version  image  docs  not  necessarily  represent 
the  current  version!  This  can  happen  if  the  current  version  had  been  copied  while  the  object  had 
had  a  token,  such  as  in  Figure  8b,  after  which  the  token  was  committed  but  not  copied. 
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Fortunately,  these  two  anomalies  are  mutually  exclusive.  Thus,  if  the  first  version  image  vij  (Idlest 
in  VS),  found  for  a  particular  object  represents  a  committed  version,  it  is  necessary  to  continue  die 
search  until  a  version  image  vi^  that  represents  a  different  version  or  a  valid  (not  aborted)  token  is 
found.  If  an  image  of  the  object  header  is  found  next  after  vij  ,  then  vi^  is  the  version  image 
pointed  to  by  the  token  reference,  if  not  nil,  the  current  version  reference  otherwise.  Now  vij 
represents  the  current  version  if: 

1)  vi^  represents  a  token  or 

2)  ts(\i^)  <  ts(vij  )  where  ls  is  the  start  validity  time  of  that  version. 

If  <s(vi\)  >  Uvii  ).  then  vi^  is  the  current  version  and  the  object  does  not  have  a  token. 

A  token  representation  is  indistinguishable  from  a  version  representation.  If  there  exists  a  reference 
to  version  image  X  in  another  version  image,  X  must  be  a  committed  version.  Hut  if  a  version 
image  is  retrieved  without  such  context,  to  distinguish  between  a  committed  version  and  a  token,  it 
is  necessary  to  check  the  commit  record,  or,  more  specifically,  the  local  commit  record 
representative.  This  is  why  a  version  image  of  a  token  (and  consequently,  a  version)  must  contain 
the  uid  of  its  commit  record.  Also,  when  an  image  of  an  object  header  is  found,  it  may  have  been 
written  into  VS  as  part  of  an  operation  that  has  not  yet  been  committed  Recall  that  a  VS  image  of 
the  object  header  is  made  when  the  object's  status  is  changed:  the  object  is  created  or  deleted,  or 
its  access  specification  is  changed.  Again,  it  is  necessary  to  use  the  commit  record  reference  in  the 
object  header  image  to  determine  the  state  of  the  possibility  under  which  the  status  of  the  object 
was  to  be  changed.  Thus  an  important  part  of  reconstructing  the  object  headers  is  finding  the 
appropriate  commit  records. 

Since  commit  records  are  represented  by  objects,  they  must  first  be  recovered  by  the  same 
mechanism  as  objects  representing  clients'  data.  However,  at  llie  time  of  a  crash,  a  large  portion  of 
the  commit  records  that  will  have  to  be  inspected  during  recovery  have  been  probably  deleted. 
This  means  that  their  object  headers  were  w  ritten  into  VS.  marked  as  deleted.  The  repository  does 
not  have  to  recover  deleted  objects  (given  that  the  deletion  was  committed),  but  it  must  temporarily 
recover  deleted  commit  records,  w>  that  other  objects  can  be  recovered.  Since  VS  images  of  object 
headers  arc  easily  distinguishable  (their  commit  record  reference  is  nil),  the  handling  of  deleted 
commit  records  does  not  represent  a  major  problem. 

The  copying  of  version  images  by  the  OVS  manager  complicates  also  the  reconstruction  of  the 
relevant  commit  records.  Without  copying,  the  commit  record  of  a  possibility  that  reached  the 
final  state  would  he  guaranteed  to  be  recovered  prior  to  all  version  images  and  object  header  images 
created  under  that  possibility.  When  a  committed  version  image  is  copied,  it  gets  "ahead"  of  its 
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commit  record,  that  is,  the  recovery  process  will  find  that  version  image  before  it  recovers  the 
commit  record.  This  can  happen  even  if  the  copied  version  image  is  still  a  token:  if  the  copying  of 
the  token  occurs  just  before  the  stale  of  the  possibility  is  finalized,  the  copy  of  die  token  and  the 
version  image  of  die  commit  record  may  end  up  in  different  VS  buffers,  and  be  written  into  VS  in 
the  reverse  order.  The  images  of  object  headers  are  always  ordered  correctly  in  VS,  since  they  arc 
read  from  VS  only  during  recovery  and  therefore  arc  not  copied  by  the  OVS  manager. 

The  search  process  sketched  in  the  beginning  of  this  section  must  be  expanded  to  take  into  account 
the  problem  of  recovering  the  commit  records.  It  is  assumed  that  only  the  final  state  of  a  possibility 
is  recorded  in  stable  storage.  Also,  if  the  recovery  process  docs  not  find  a  version  representing  the 
final  state  of  a  given  possibility,  it  cannot  abort  the  possibility .  since  the  reconstructed  local  object 
might  be  just  a  representative  of  the  commit  record. 

Again,  the  exact  recovery  of  individual  objects  depends  on  in  what  order  the  various  relevant 
entities  arc  found: 

►  The  first  entity  found  is  an  image  of  the  object  header: 

Since  the  VS  images  of  object  headers  are  not  copied  in  OVS,  then  if  die  changes  to  the  object 
status  as  reflected  by  this  object  header  image  were  finalized  (committed  or  aborted),  die 
appropriate  commit  record  version  must  have  been  already  found  by  the  recovery  process.  If  it 
has  not  been  found,  the  possibility  is  still  in  unknown  state.  In  any  case,  the  current  version 
reference  and  token  reference  in  this  object  header  image  can  be  used  to  rebuild  die  object 
header  in  Of  IS.  If  the  found  object  header  image  is  not  committed,  the  version  reference  in  this 
image  can  be  used  to  find  the  preceding  VS  image  of  the  object  header  which  contains  the 
correct  stable  information  for  this  object. 

►  The  first  entity  found  is  a  version  image:  call  it  again  vij  : 

1.  The  commit  record  for  this  version  image  lias  already  been  reconstructed.  I  bis  can  happen 
only  if: 

a.  vi|  is  a  committed  version  that  has  not  been  copied;  since  this  is  the  first  image  found, 
the  object  does  not  have  a  token. 

b.  vij  is  an  aborted  token.  Embedded  in  this  token  is  a  reference  to  the  current  version; 
neither  the  current  version  nor  the  commit  record  for  the  current  version  have  to  lie 
searched. 

2.  The  commit  record  has  not  been  reconstructed  yet.  Ibis  can  happen  if: 
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a.  Ilic  final  version  of  the  commit  record  has  not  been  created  yet,  thus  vij  represents  a 
token. 

b.  vi|  represents  a  committed  version  that  was  copied  by  the  OVS  manager.  ' 

c.  vij  represents  an  aborted  token  that  got  ahead  of  the  final  version  of  the  commit  record 
due  to  the  nonsequential  management  of  the  VS  buffers. 

To  resolve  this  uncertainty,  it  is  necessary  to  continue  the  search  of  VS  until  one  of  the  , 

following  is  found:  ;■ 

i.  A  version  image  of  the  commit  record: 

-  If  the  embedded  possibility  state  is  unknown,  vi|  is  a  token,  and  it  contains  a  * 

reference  to  the  current  version.  !'l 

■  If  the  embedded  possibility  state  is  committed,  vij  is  a  copy  of  the  current  | 

version;  it  is  still  necessary  to  search  for  the  possible  token.  !’j 

-  If  the  embedded  possibility  state  is  aborted,  vij  is  a  copy  of  an  aborted  token,  vij  t* 

contains  a  reference  to  the  current  version,  and  the  object  does  not  have  a  token.  H 

ii.  Another  version  image,  viy,  created  under  a  different  possibility  than  vi.  (ihis  ,  *  , 

I  i 

restriction  is  sufficient  to  handle  correctly  situations  where  viy  is  just  another  cop>  of  the  same  version  H 

image,  and  also  the  cases  when  multiple  tokens  were  created  under  the  same  possibility):  |  £ 

-  If  ts(vi^)  <  ts(vij  )  then  vij  must  be  a  token.  I  imbedded  in  vij  is  a  reference  to 

the  current  version;  this  is  not  necessarily  vi^,  since  vi^  could  he  an  aborted  token 
or  a  no  longer  accessible  copy  of  an  earlier  version. 

•  If  ts(vi^)  >  ts(vij  ),  then  vi[  must  be  a  copy.  If  vi^  is  a  token  or  an  aborted  jj 

token,  then  vij  represents  the  current  version,  otherwise  it  is  a  copy  of  the  ' ! 

preceding  version.  Thus  it  is  necessary  to  continue  the  search  of  VS  until  the  j 

commit  record  for  vi^  is  reconstructed.  j 

I'inally,  the  object  headers  contain  the  end  time  of  the  current  version  and  the  token;  this 

information  also  must  be  reconstructed  somehow.  If  an  object  lias  a  token,  the  end  time  of  the  j 

current  version  must  he  one  "tick"  less  than  the  creation  time  of  die  token.  The  end  time  of  the  j 

token,  and  if  an  object  docs  not  have  a  token,  then  the  end  lime  of  the  current  version,  ought  to  be  i 

II 

set  to  the  current  time,  that  is.  the  time  when  the  object  is  recovered.  ii 


6.3  Real-time  recovery 


Hie  actual  recovery  process  should  be  as  efficient  as  possible  so  that  the  delays  experienced  by  the 
clients  will  not  be  noticeable.  The  repository  can  limit  the  extent  of  crash  recovery  through  special 
checkpoints.  In  addition,  rather  than  recovering  all  objects  in  the  icpository  before  resuming 
normal  processing,  recovery  can  he  distributed  over  time.  In  particular,  individual  ohjccts  can  be 
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recovered  as  they  are  accessed. 


l  or  this,  it  is  necessary  to  he  able  to  distinguish  the  epochs  between  dilTcrent  recoveries.  Thus  the 
repository  should  maintain,  as  part  of  its  state,  the  current  recovery  epoch  number.  RKN.  IT  cry 
time  the  recovery  process  is  started,  the  repository  is  assigned  a  new  KI  N  such  that  these  numbers 
monotorically  increase  in  time.  RKN  must  be  included  also  in  each  object  header.  When  an  object 
is  created,  it  is  assigned  the  current  RKN.  When  an  object  is  accessed  through  any  of  the 
operations  listed  in  Section  4,  then  if  its  OHS  image  is  not  damaged,  the  RKN  in  the  object  header 
is  compared  to  the  current  RKN  of  the  repository.  If  they  differ,  the  object  header  must  be 
updated  to  reflect  the  changes  since  the  time  the  object  header  was  written  into  VS  during  the 
recovery  epoch  as  given  by  its  RKN.  If  the  object  is  locked,  the  lock  is  simply  broken:  the  locks 
must  be  honored  only  if  the  object  RKN  and  the  current  repository  RKN  are  the  same.  If  an  object 
is  not  used  for  a  long  time,  several  crashes  (and  recoveries)  could  have  occurred  since  the  object  was 
created  or  recovered.  However,  since  such  an  object  has  not  been  recovered  earlier,  it  could  not 
have  been  used  (read  or  written)  since  the  recovery  epoch  given  by  its  RKN.  and  tints  to  recover 
such  an  object,  it  is  not  necessary  to  search  VS  front  its  current  end.  but  only  from  the  point  that 
corresponds  to  the  end  of  that  epoch. 

Thus,  the  recovery  process  should,  at  the  commencement  of  a  recovery,  write  a  mark  into  VS  that 
specifies  the  beginning  of  a  new  recovery  epoch.  For  quick  location  of  these  marks,  they  should  be 
chained  together  as  are  the  histories  of  individual  objects,  lluis  the  recovery  mark  can  be 
represented  by  an  object:  if  the  object  header  in  OIIS  survives  the  crash,  the  last  version  is  easy  to 
find,  and  the  new  version  of  the  mark  can  be  created  with  die  reference  to  the  last  one  immediately. 
If  the  object  header  of  the  recovery  mark  is  destroyed,  it  is  necessary  first  to  search  VS  for  the  last 
version  version  of  the  recovery  mark.  The  object  header  of  the  recovery  mark  is  modified  only 
during  recovery,  and  it  should  be  forced  immediately  into  OIIS.  This  guarantees  that  the  correct 
information  is  always  in  OIIS  and  thus  should  survive  most  crashes. 

When  an  object  is  recovered,  its  RKN  in  the  reconstructed  object  header  is  set  to  the  current  RKN. 
Also,  a  VS  image  of  the  object  header  should  be  created:  this  will  delimit  the  extent  of  the  next 
recovery  should  the  OIIS  image  be  damaged.  In  such  a  case,  the  recovery  must  start  from  the 
current  end  of  VS. 

In  the  process  of  reconstructing  an  object,  it  is  again  necessary  to  reconstruct  the  appropriate 
commit  record! s).  Since  atomic  actions  survive  repository  crashes,  the  fact  that  the  final  version  of  a 
commit  record  is  not  found  in  the  same  recovery  epoch  as  the  object  in  question  docs  not  mean 
that  the  stale  of  the  possibility  has  not  been  resolved.  Mill  since  a  commit  record  is  also  an  object. 
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i1  r-yritWinw 


an  attempt  to  access  it  will  direct  automatically  the  recovery  process  into  the  right  recovery  epoch. 

6.4  Communication  with  brokers 

A  failure  of  a  repository  can  also  affect  the  brokers.  It  is  the  resposibility  of  the  brokers  to 
supervise  that  requested  operations  arc  indeed  performed  by  the  repositories.  If  a  broker  docs  not 
receive  a  reply  from  a  repository,  then  unless  the  requested  operation  is  not  important  for  correct 
completion  of  the  given  atomic  action,  the  btoker  has  two  options: 

i.  abort  the  entire  atomic  action,  or 

ii.  repeat  the  request. 

Now.  of  course  it  is  possible  that  the  first  request  was  received  and  processed  by  the  repository,  but 
since  all  operations  supported  by  the  repositories  arc  idempotent  (if  they  carry  the  same  pseudo- 
time).  duplicate  requests  do  not  represent  any  problem.  The  only  complication  arises  if  a  message 
from  a  broker  containing  data  for  a  token  is  delivered  in  pieces.  Unless  the  entire  structured 
version  image  was  already  created,  if  the  request  is  repealed,  the  previous  incomplete  message  must 
be  discarded,  since  the  partitioning  of  the  repeated  message  may  be  different  from  the  previous  one. 
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7.  Summary 


Figure  20  summarizes  the  structure  of  a  SWA1.I.OW  repository  as  a  lattice  of  abstractions.  A  more 
detailed  description  of  the  structure  is  given  in  the  appendix.  The  entire  design  of  the  repository  is 
centered  around  the  Version  Storage,  which  is  the  only  stable  storage  in  the  repository.  In  a  sense, 
VS  is  similar  to  the  transaction  log  of  database  management  systems  [GRAY  79],  However,  there  is 
an  important  difference:  VS  is  used  not  just  for  recovery,  but  it  is  where  the  actual  data  are. 

VS  contains  not  only  the  versions  of  objects,  but  also  the  commit  records  and  images  of  the  object 
headers.  However,  the  name  Version  Storage  has  been  retained,  since: 

i.  commit  records  are  represented  by  ordinary  objects  (and  thus  VS  contains  their 
versions),  and 

it.  die  object  header  images  arc  in  fact  selected  versions  of  the  stale  of  individual 
objects. 

VS  is  append-only  storage,  in  accordance  with  the  basic  object  model.  It  provides  a  linear  paged 
address  space  with  a  straightforward  mapping  from  the  VS  address  into  a  location  on  the  physical 
device.  VS  is  duplicated  for  stability,  but  since  no  update  in  place  is  possible,  die  two  required 
writes  can  be  concurrent. 

Since  VS  may  grow  very  large,  it  is  impossible  to  maintain  the  entire  VS  online.  Only  the  upper  2n 
words  of  VS  are  kept  in  the  Online  Version  Storage.  OVS  would  thus  contain  the  current  versions 
and  tokens  of  the  recently  updated  objects.  To  make  sure  that  die  current  versions  of  most  objects 
are  found  in  OVS.  it  is  necessary  to  copy  occasionally  the  images  of  current  versions  and  tokens  to 
the  high  end  of  VS.  The  most  reasonable  policy  for  managing  OVS  seems  to  be  to  copy  a  version 
image  when  the  repository  is  processing  a  read  request  involving  a  current  version  or  a  token  and 
the  representing  VS  image  is  found  to  have  a  lower  VS  address  than  the  copy  mark.  This  policy 
preserves  locality  of  reference,  and  automatically  brings  back  online  the  current  versions  of  the 
objects  that  have  not  been  used  for  a  long  time. 

OVS  can  be  implemented  with  a  reusable  device,  or  with  wrilc-oncc  devices.  Hie  latter  form 
simplifies  the  transfer  of  version  images  from  online  to  offline  storage.  I  hc  delays  due  to  manual 
device  replacement  can  be  eliminated  through  a  circular  assignment  of  device  drivers  to  different 
functions  in  the  implementation  of  OVS. 

The  crash  recovery  of  the  repositories  is  based  entirely  on  the  information  contained  in  VS. 
Current  contents  of  object  headers,  although  the  object  headers  arc  the  key  elements  in  all 
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from  brokers  from  otner  repositories  TCCOVCf 


internal  abstractions 


Figure  20:  Structure  of  the  repository. 
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operations  on  objects,  arc  treated  as  hints  that  arc  fully  rcconstructablc  from  the  information  found 
in  VS.  Since  the  commit  records  arc  implemented  as  objects,  they  arc  rcconstructablc  by  the  same 
process,  I'inally,  the  object  directory  is  an  object  itself  and  hence  rcconstructablc  from  the 
information  in  VS. 

This  report  presented  only  a  skeleton  for  the  design  of  the  SWAI.I.OW  repositories.  Many  issues 
were  touched  on  only  very  lightly,  and  some  important  issues  have  not  been  addressed  at  all.  In 
particular,  performance  of  OVS  under  the  proposed  copying  policy  needs  to  be  evaluated  and  the 
sketched  algorithm  for  reconstruction  of  the  object  headers  ought  to  be  analyzed  more  formally  for 
possible  inconsistencies.  Some  of  the  additional  issues  are: 

i.  Virtual  memory.  It  has  been  assumed  that  both  VS  and  01  IS  arc  divided  into  pages, 
and  that  pages  from  both  arc  brought  into  main  memory  on  demand.  So  far,  OHS  and  VS 
have  been  treated  as  distinct  address  spaces.  Ibis  means  that  to  implement  virtual  memory 
their  pages  would  have  to  be  mapped  into  main  memory  in  different  ways.  Alternatively, 

L 

OHS  and  VS  can  be  made  part  of  the  same  address  space,  c.g..  01  IS  can  be  the  lowest  2K 
words  of  that  space. 

ii.  Communication  with  brokers  and  other  repositories.  Objects  can  be  sent  to 
repositories  in  pieces,  subject  to  the  constraints  imposed  by  the  communication  substrate 
and  communication  buffer  capacity  of  the  receiver.  Although  the  repository  can  deal  with 
pieces  of  any  size  (if  they  arc  too  big.  they  will  be  broken  up  further  before  being  stored  as 
data  images),  better  performance  can  be  achieved  if  the  communication  substrate  already 
delivers  pieces  of  the  right  size;  the  optimal  size  is  the  size  of  .1  page  minus  the  amount  of 
storage  needed  for  the  size  field  and  the  type  tag  which  arc  added  when  a  data  image  is 
created. 

iii.  Protection.  It  is  assumed  that  object  versions  in  the  repository  will  he  stored  in  an 
encrypted  form,  where  encryption  provides  the  only  kind  of  protection  for  read  accesses 
|RKKI)  80).  Some  protection  against  modification  is  provided  hy  the  immutability  of  object 
versions,  but  it  should  be  possible  to  control  the  ability  to  create  and  delete  objects,  create 
tokens  and  change  the  state  (commit  or  abort)  of  commit  records.  Objects  and  commit 
records  in  the  repository  were  designed  to  include  an  access  control  specification  field  which 
is  stable;  however,  it  is  not  clear  what  should  he  in  this  field  and  how  the  rights  of  the 
requestors  should  be  checked.  An  interesting  question  is  what  the  right  to  read  means  in 
the  context  of  the  given  object  model.  In  particular,  docs  a  revocation  of  such  right  apply 
only  to  the  future  versions  of  the  object,  or  also  to  the  current  and  the  past  versions? 

iv.  The  repository  provides  mechanisms  that  facilitate  building  of  atomic  actions; 
however,  it  is  the  responsibility  of  the  users  of  SWAl  I.OW  to  make  sure  that  these 
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mechanisms  arc  used  properly.  The  division  of  responsibility  for  correct  implementation  of 
atomic  actions  should  be  studied  in  more  depth.  SWALLOW  could  assist  in  enforcing 
correct  use  by  supervising  that: 

a.  a  possibility  cannot  be  committed  until  all  outstanding  requests  to  create  a 
token  have  been  received  and  processed 

b.  once  a  possibility  is  committed  or  aborted,  no  new  tokens  can  be  added. 
However,  distributed  possibilities  make  such  checking  difficult 
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Appendix 


STRUCTURK  01  THK  RKP0SIT0RY 


This  appendix  describes  in  more  detail  the  individual  modules  of  the  repository  and  their  logical 
interconnection  (the  "uses"  hierarchy  presented  in  l-igurc  20).  Note  that  some  modules  support 
more  than  one  abstraction  developed  in  this  report,  I'xternal  operations  arc  the  operations  provided 
at  the  module's  interface,  that  is.  operations  that  can  be  invoked  from  other  modules.  Internal 
operations  arc  available  only  within  the  module.  Recovery  operations  are  special  external  operations 
that  arc  invoked  only  by  the  recovery  process. 


Request  handler 

implements:  repository  interface 

uses:  object 

commit  record 

SWALLOW  \ l  essuge  Protocol 

The  request  handler  inspects  messages  delivered  by  the  SWM.l  OW  Message  Protocol  [RL.LD  80j 
and  invokes  the  appropriate  manager  to  handle  the  request,  and  it  constructs  reply  messages  from 
the  information  returned  by  the  manager. 


Commit  record  manager 

implements:  commit  record 

commit  record  representative 

uses:  object 

use: 

create  object 
create  token 

primitives  of  the  implementation  language 

create  token 
commit  token 
delete  reference 
delete  object 
create  token 
commit  token 
abort  token 
delete  reference 
delete  object 


external  operations: 


create 

--> 

add  reference 

--> 

commit 

--> 

abort 

~> 

■ M  .iritf ■■>*•**• 
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internal  operations: 

delete  reference 
delete 

recovery  operations: 

none  (recovered  only  as  objects ) 


use: 

")  primitives  of  the  implementation  language 

-->  delete  object 


Object  manager 


implements:  object 

uses:  directory 

object  history 
uid 

external  operations: 

use: 

create 

-> 

get  new  uid 
create  object  history 
enter  into  directory 

read 

--> 

lookup  directory 
lead  object  history 

create  token 

-•> 

lookup  directory 
create  token 

commit  token 

--> 

lookup  directory 
commit  token 

abort  token 

--> 

lookup  directory 
abort  token 

set  access  control 

--> 

lookup  directory 

set  access  control  on  object  history 

delete 

--> 

lookup  directory 
delete  object  history 
delete  from  directory 

recovery  operations: 
none 


Ul()  manager 

implements:  uid 

uses:  object  history 
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external  operations: 


use: 


new 

--> 

may  have  to  create  new  version 

recovery  operations: 

use- 

reset  uid 

--> 

reconstruct  object  history 

Directory  manager 

implements:  directory 

uses:  object  history 

external  operations: 

use: 

create 

~> 

create  object  history 

enter 

--> 

priinimc*  <>f  Ihc  implementation  language 

lookup 

--> 

primitives  of  the  implementation  language 

recovery  operations: 

use: 

recover 

--> 

reconstruct  object  history 

Object  history  manager 

implements:  object  history 

uses:  version  image 

OHS  image 

external  operations:  use: 


create 

~> 

create  object  header 

create  version  image  (of  object  header) 

create  OHS  image 

read 

--> 

read  object  header 

read  version  image  (returns  also  Aq) 

copy  current  version 

copy  token 

create  token 

--> 

read  object  header 
create  vet  >n  image 

commit  token 

--> 

read  objcc.  .leader 

abort  token 

--> 

read  object  header 

set  access  control 

--> 

read  object  header 

create  version  image  (of  object  header) 

delete 

~> 

read  object  header 

create  version  image  (of  object  header) 
delete  01  IS  image 

•  JbMKiSt - 
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internal  operations: 


use: 


create  object  header  ~> 

read  object  header  ~> 

write  object  header  --> 

copy  current  version  -> 

copy  token  -> 


primitives  of  ihe  implementation  language 

read  OHS  image 
write  01  IS  image 
copy  version  image 
copy  version  image 


recovery  operations: 

reconstruct  ~> 

Version  image  manager 

implements:  simple  version  image 

structured  version  image 
I  S  image  of  object  header 

uses:  VS  image 

external  operations: 


create  version  image  --> 

read  version  image  -> 

copy  vcision  image  ~> 

recovery  operations: 
search 


use: 

read  object  header 

search  version  image 

create  version  image  (of  object  header) 

write  object  header 


use: 

create  VS  image 

read  VS  image  (returns  also  Ac) 
create  VS  image 

use: 

next  VS  image 


VS  image  manager 

implements:  VS  image 

uses:  VS 

external  operations: 

read 

create 


use: 

” '  r°ad  VS  page  (returns  also  A^) 
append  VS 


recovery  operations: 

next  (iterator) 


use: 


VS  manager 

implements:  VS 

uses:  main  memory  page 

storage  device 

external  operations: 

append  --> 

read  page  -> 

internal  operations: 


append  VS  buffer  ~> 

reset  --> 

get  Ac  --> 

assign  device  drivers  --> 

recovery  operations: 

find  end  ~> 

next  page  (iterator)  --> 


Ol  IS  image  manager 
implements:  OHS  image 

uses:  storage  device 

external  operations: 


create 

--> 

read 

--> 

write 

--> 

delete 

--> 

next  VS  page 


use: 

append  VS  buffer 

read  storage  dev  ice  page  (OYS  or  offline  VS) 
get  (returned  together  w  ith  the  requested  page) 


use: 

allocate  main  memory  page 
w  rite  storage  dev  ice  page 
primitives  of  the  implementation  language 
primitives  of  the  implementation  language 
primitives  of  the  implementation  language 

use: 

read  storage  dev  ice  page  (recover  Mp) 
read  storage  device  page 


use: 

write  storage  device  page 
read  storage  device  page 
write  storage  device  page 
write  storage  device  page 


recovery  operations: 


use: 


none 


Crash  recovery 

uses:  uid 

object  history 
VS 

external  operations: 

start  recovery 

internal  operations: 

create  rccov  cry  mark 


use: 

-->  find  end  of  VS 

create  recovery  mark 

use: 

-->  get  new  uid  (new  RKN) 

create  token  (for  the  recovery  mark  object) 
commit  token 


OFFICIAL  DISTRIBUTION  LIST 


Defense  Technical  Information  Center 
Cameron  Station 


Alexandria,  VA  22314 

Office  of  Naval  Research 
Information  Systems  Program 
Code  437 

Arlington,  VA  22217 
2  copies 

Office  of  Naval  Research 
Branch  Office/Boston 
Building  114,  Section  D 
666  Sunnier  Street 
Boston,  MA  02210 
1  copy 

Office  of  Naval  Research 
Branch  Office/Chicago 
536  South  Clark  Street 
Chicago,  IL  60605 
1  copy 

Office  of  Naval  Research 
Branch  Office/Pasadena 
1030  East  Green  Street 
Pasadena,  CA  91106 
1  copy 

New  York  Area 
715  Broadway  -  5th  floor 
New  York,  N.  Y.  10003 
1  copy 

Naval  Research  Laboratory 
Technical  Information  Division 
Code  2627 

Washington,  D.  C.  20375 
6  copies 

Assistant  Chief  for  Technology 
Office  of  Naval  Research 
Code  200 

Arlington,  VA  22217 
1  copy 


12  copies 

Office  of  Naval  Research 
Code  455 

Arlington,  VA  22217 
1  copy 

Dr.  A.  L.  Slafkosky 
Scientific  Advisor 
Commandant  of  the  Marine  Corps 
(Code  RD-1) 

Washington,  D.  C.  20380 
1  copy 

Office  of  Naval  Research 
Code  458 

Arlington,  VA  22217 
1  copy 

Naval  Ocean  Systems  Center,  Code  91 
Headquarters-Computer  Sciences  & 
Simulation  Department 
San  Diego,  CA  92152 
Mr.  Lloyd  Z.  Maudlin 
1  copy 

Mr.  E.  H.  Gleissner 

Naval  Ship  Research  &  Development  Center 
Computation  &  Math  Department 
Bethesda,  MD  20084 
1  copy 

Captain  Grace  M.  Hopper,  USNR 
NAVDAC-OOH 

Department  of  the  Navy 
Washingon,  D.  C.  20374 
1  copy 


